|
|
|
== Blackbox Description ==
|
|
|
|
|
|
|
|
To improve the reliability of systems using redundant subsystems is to support fault tolerance mechanisms that adapt to reliability changes of subsystems during the system’s life time. For instance, if the ability to recover from errors is exhausted for a particular replicated subsystem because too many permanent errors have accumulated (e.g., one replica of a Triple Modular Redundancy system has failed permanently), appropriate actions have to be taken to enhance the reliability (e.g., the migration of the replicated system functionality to a different IP core in a multi-core system). This is a requirement for enabling sustained operation of components, which is demanded for applications that require a non-stop operation throughout their entire lifetime.
|
|
|
|
In addition, since on-call maintenance can be very cost intensive due to maintenance contracts and service outages, the universAAL architecture shall enable the shift of on-call maintenance to periodic maintenance. A shift to periodic maintenance can be achieved by fault-tolerance techniques that retain, in case of an internal error, the correct system functionality until the next scheduled service date.
|
|
|
|
|
|
|
|
''Ground Rules''
|
|
|
|
|
|
|
|
Creating fault tolerant behavior in a hardware/software system is a complex process. Faults have diverse sources, from physical failures of the hardware, logic errors in the software, either internal or external. They may be operational errors or the result of malicious use. Faults can be temporary or persistent. Software faults are due to flaws in the design of the system.
|
|
|
|
|
|
|
|
Descriptions of failure scenarios range from the complete loss of power to the failure of individual components. The classification of failures and their consequences is always unique to the particulars of the service provided. Consequently, approaches for achieving a particular level of dependability will vary. Fault prevention in the design phase, and fault removal through maintenance are important means in delivering reliable software.
|
|
|
|
|
|
|
|
There is a precise and rigorous terminology used in the literature to describe the basic concepts of dependable computing codified by Laprie <ref>J.C. Laprie (ed.). Dependability: Basic Concepts and Terminology, Dependable computing and fault tolerant systems services, Vol 5. Springer-Verlag 1992</ref>; see also Avizienis, Laprie, and Randall <ref>A. Avizienis, J.-C. Laprie, B. Randall. Fundamental Concepts of Dependability", 2001 </ref>.
|
|
|
|
A system is a collection of interacting components that deliver a service through a service interface to a user. The user can be a human operator, or another computer system. The service delivered by a system is its behaviour perceived by the user.
|
|
|
|
Dependability of a computing system is the ability to deliver service that can justifiably be trusted. Applications can emphasize different attributes of dependability, including:
|
|
|
|
*Availability, the readiness for correct service.
|
|
|
|
*Reliability, the continuity of that service.
|
|
|
|
*Safety, the avoidance of catastrophic consequences on the environment.
|
|
|
|
*Security, the prevention of unauthorized access.
|
|
|
|
|
|
|
|
The function of a system is what the system is intended to do, as described by the functional specification. A system failure occurs when the service delivered does not comply with the specification. The system state is the set of the component states.
|
|
|
|
|
|
|
|
*An error is a system state that may lead to failure. An error is detected if an error message or signal is produced within the system or latent if not detected. A fault is the cause of an error, and is active when it results in an error, otherwise it is dormant.
|
|
|
|
|
|
|
|
*Fault tolerance is the ability of a system to deliver of correct service in the presence of faults <ref>A. Avizienis. Fault Tolerant Systems, IIEE Trans. Computers Vol C-25 No 12. 1976</ref>. This is achieved by error processing —removing the system error state— and by treating the source of fault. The ability to detect and process error states and assess the consequences is critical requirements of fault tolerant design.
|
|
|
|
Fault tolerance —both hardware and software— is achieved through some kind of redundancy. Hardware redundancy techniques often make use of multiple identical units, in addition to a means for arbitrating the resulting output. ECC memory, for example, uses a few extra bits to detect and correct errors resulting from faults in the individual storage bits. Running the same input data through a faulty software module multiple times yields the same erroneous result each time. Software fault tolerance is built by applying algorithmic diversity, computing results through independent paths, and by judging the results. This adds complexity to the system in general. Adding software fault tolerance will improve system reliability only if the gains made by the added redundancy are not offset by commensurate new faults introduced by the parallel code.
|
|
|
|
|
|
|
|
== Design decisions ==
|
|
|
|
|
|
|
|
Reliability building block goal is to improve the reliability aspects of the universAAL platform. Therefore the reliability building block is a vertical layer cross over all layers of universAAL, especially in the Middleware. This can be done by dealing with to major challenges of reliability and enhance the system efficiency. The first action point, the creation of a framework to diagnose the system behaviour by detecting the faults that might occur during the systems operation, and take decisions to overcome such cases. Taking into consideration the existing components of the Middleware, the following components will be reused in the Diagnosis Framework: Context Events, Context Bus and the Situation Reasoner [http://forge.universaal.org/wiki/uaal_context:Home| (see Context Group wikipages for more details)]. The Diagnosis Framework, should not create further effort on the operational load of the platform or interrupt other services. The Middleware has a message based communication.Hence, fault detection mechanisms is also using message classification algorithms in order to categorize messages and differentiate all message types interacting in the platform. The diagnosis framework uses a knowledge base of rules that determine the behaviour of the system and define possible solutions. This knowledge base has to be fed continuously with new knowledge and cases to be able to decide in more and more use cases. A Fault injection framework has been implemented to create a high effort testing scenarios for a number of nodes in an AAL space, after the end of this check, a file of feedback results can be used in the knowledge base that is used in the Diagnosis Framework. The Fault Injection Framework in its final version will be fully independent bundle from the Middleware. This will also give universAAL administrators the ability to test the functionality of any uAAL space remotely. The third bundle in the Reliability building block is the Time Triggered patch, this patch is giving the users of universAAL platform the possibility to have an advantage of using a time triggered communication in there uAAL spaces where many reliability aspects are taken already into consideration in the infrastructure used in such communication (e.g. global time synchronization, reliable communication of critical events in the system).
|
|
|
|
|
|
|
|
== Requirements ==
|
|
|
|
|
|
|
|
''High-level requirements''
|
|
|
|
|
|
|
|
* '''RC9_R1''' ''Dependability:'' The universAAL architecture shall support the delivery of services that can justifiably be trusted, where the service is the intended behavior of the system. The system must be resilient with respect to unanticipated behavior from the environment or of subsystems (e.g., transient and permanent hardware faults, design faults).
|
|
|
|
|
|
|
|
''Technical requirements''
|
|
|
|
|
|
|
|
* '''RC9_TR1''' ''Modular Certification of Subsystems:'' It must be possible to certify different subsystems individually.
|
|
|
|
* '''RC9_TR2''' ''Design for Testability:'' Testability shall be supported by the architecture (design testing, system-integration testing, manufacturing testing and assembly testing).
|
|
|
|
* '''RC9_TR3''' ''Correctness-by-Construction:'' Provably correct design methods shall be supported by the architecture with which a specification is transformed step by step into a correct design.
|
|
|
|
* '''RC9_TR4''': ''Delay/Disruption-tolerant Networking:'' Communication services that tolerate delays/disruption shall be provided by the architecture.
|
|
|
|
* '''RC9_TR5''' ''Communication Resource Guarantees:'' For messages that are exchanged within a certain subsystem, guarantees of the lower bound on the communication bandwidth, upper bounds on the latency and jitter shall be determinable.
|
|
|
|
* '''RC9_TR6''' ''Unreliable Components:'' The architecture must be capable to tolerate the failure of individual devices and inter-connects.
|
|
|
|
* '''RC9_TR7''' ''Fault Hypothesis:'' Assumptions shall be identified that define the type and frequency of faults that the sys-tem has to be able to tolerate
|
|
|
|
* '''RC9_TR8''' ''Error-Containment:'' The architecture must support the establishment of error containment regions, where errors can be detected with defined error-containment coverage.
|
|
|
|
* '''RC9_TR9''' ''Minimum of two Fault-Containment Regions:'' In case the occurrence of arbitrary (byzantine) failures within one fault containment region cannot be eliminated, an error containment region must be built of at least two fault containment regions.
|
|
|
|
* '''RC9_TR10''' ''Consistent membership Service:'' A membership service shall exist within the architecture that consistently provides sub-systems with the health state of other subsystems.
|
|
|
|
* '''RC9_TR11''' ''Generic Fault-tolerance Layer:'' A common API shall transparently mask fault-tolerance mechanisms of the environment to the application.
|
|
|
|
* '''RC9_TR12''' ''Tolerance of Software Errors:'' Protection mechanisms within the architecture shall be able to handle software errors.
|
|
|
|
* '''RC9_TR13''' ''Bounded Start-up and Restart Time:'' A known, bounded and minimal start-up time of system components has to be assured by the architecture.
|
|
|
|
* '''RC9_TR14''' ''Fault Classification:'' Error-detection mechanisms provided by the architecture have to distinguish between transient and permanent faults.
|
|
|
|
* '''RC9_TR15''' ''Pre-emptive Resource Allocation:'' The architecture must ensure that individual subsystems cannot dominate/block shared communication resources.
|
|
|
|
* '''RC9_TR16''' ''Worst Case Execution Time Analysis:'' The calculation of the worst-case execution time (WCET) of software modules with feasible effort shall be supported by the architecture.
|
|
|
|
* '''RC9_TR17''' ''Mixed-Criticality Subsystems:'' It shall be possible to use subsystems with different levels of criticality within the one system.
|
|
|
|
* '''RC9_TR18''' ''Diagnostic Service:'' Identification of faulty subsystems for maintenance should be supported by the architecture. The diagnostic service needs a holistic view on the system, so that correlated failures and anomalies can be detected.
|
|
|
|
* '''RC9_TR19''' ''No Probe Effect:'' There must be no interference from the diagnostic service on the subsystems that are diagnosed.
|
|
|
|
* '''RC9_TR20''' ''Systematic Diagnostic Methods:'' The detection of application-independent failures modes (e.g., communication errors) should be supported by providing systematic diagnostic methods.
|
|
|
|
* '''RC9_TR21''' ''Application-specific Diagnostic Methods:'' Diagnostic services should be configurable to enable the detection of application-specific failures.
|
|
|
|
* '''RC9_TR22''' ''State Enforcement:'' It shall be possible to set the history state of a subsystem.
|
|
|
|
* '''RC9_TR23''' ''Different Levels of Reliability:'' The architecture shall provide different levels of reliability of the communication service.
|
|
|
|
* '''RC9_TR24''' ''Handling of Changing Reliability:'' Fault tolerance mechanisms shall be capable of adapting to changed reliability of subsystems over lifetime.
|
|
|
|
* '''RC9_TR25''' ''Replication:'' Replicas and voting mechanisms (e.g., triple-modular redundancy) shall be provided for error detection and error masking.
|
|
|
|
* '''RC9_TR26''' ''Replica Determinism:'' For replicated components, replica determinism has to be assured (i.e., replicated components are in the same state and produce the same output within a defined interval of time).
|
|
|
|
|
|
|
|
==Artefact #1 : Failure Diagnosis Module in universAAL==
|
|
|
|
|
|
|
|
|
|
|
|
==== Blackbox Description ====
|
|
|
|
|
|
|
|
Fault Diagnosis is the process of determining the type, size and location of the most possible fault together with the temporal specification of the fault. Diagnosis is the reasoning process for detection, isolation, analysis and recovery of occurring faults. A Symptom is the subjective evidence of a failure that indicates the existence of fault.
|
|
|
|
|
|
|
|
The notion of Fault Containment Region (FCR) is a key concept for reasoning about the behaviour of a system in the presence of faults. The knowledge about the immediate impact of a fault can serve as the starting point for the reliability analysis of a system. In addition, fault-tolerance mechanisms such as triple modular redundancy require replicas to be assigned to independent FCRs. FCR is the set of subsystems that share one or more common resources that can be affected by a single fault and is assumed to fail independently from other FCRs.<ref> Kopetz, H.; , "Fault containment and error detection in the time-triggered architecture," Autonomous Decentralized Systems, 2003. ISADS 2003. The Sixth International Symposium on , vol., no., pp. 139- 146, 9-11 April 2003 doi: 10.1109/ISADS.2003.1193942</ref>
|
|
|
|
|
|
|
|
The main components for the diagnosis infrastructure for universAAL are as follows.
|
|
|
|
#Error detection unit: is the basic component for monitoring system's software components, processes, and processing nodes and reporting errors.
|
|
|
|
#Fault analyser : is the component which analyses the gathered error reports by detection unit based on diagnosis rules
|
|
|
|
#Failure Notifier: is the component that typically gathers fault reports and publishes the diagnosis decisions
|
|
|
|
|
|
|
|
==== Bundles ====
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" colspan="2" | Artifact: '' Failure Diagnosis Module in universAAL ''
|
|
|
|
|-
|
|
|
|
| GIT Address
|
|
|
|
| http://forge.universaal.org/svn/uaal_context/trunk/ctxt.reliability.reasoner
|
|
|
|
|-
|
|
|
|
| Javadoc
|
|
|
|
| http://depot.universaal.org/hudson/job/context/javadoc/
|
|
|
|
|-
|
|
|
|
| Design Diagrams
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
| Reference Documentation
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
|}
|
|
|
|
|
|
|
|
==== Requirements ====
|
|
|
|
* '''RC9_TR1''' ''Modular Certification of Subsystems''
|
|
|
|
* '''RC9_TR6''' ''Unreliable Components''
|
|
|
|
* '''RC9_TR7''' ''Fault Hypothesis''
|
|
|
|
* '''RC9_TR8''' ''Error-Containment''
|
|
|
|
* '''RC9_TR9''' ''Minimum of two Fault-Containment Regions''
|
|
|
|
* '''RC9_TR12''' ''Tolerance of Software Errors''
|
|
|
|
* '''RC9_TR14''' ''Fault Classification''
|
|
|
|
* '''RC9_TR17''' ''Mixed-Criticality Subsystems''
|
|
|
|
* '''RC9_TR18''' ''Diagnostic Service''
|
|
|
|
* '''RC9_TR19''' ''No Probe Effect''
|
|
|
|
* '''RC9_TR20''' ''Systematic Diagnostic Methods''
|
|
|
|
* '''RC9_TR21''' ''Application-specific Diagnostic Methods''
|
|
|
|
* '''RC9_TR22''' ''State Enforcement''
|
|
|
|
* '''RC9_TR23''' ''Different Levels of Reliability''
|
|
|
|
* '''RC9_TR24''' ''Handling of Changing Reliability''
|
|
|
|
|
|
|
|
|
|
|
|
=== Features ===
|
|
|
|
This artefact offers the following features.
|
|
|
|
|
|
|
|
#Fault hypotheses: on top of which the rules for diagnosis are designed
|
|
|
|
#Fault Containment Regions: to identify the FCRs in universAAL
|
|
|
|
#Failure modes: to identify specific failure modes for specific FCR
|
|
|
|
#Failure diagnosis rules: to reasone on the root cause of any failure
|
|
|
|
#Reliability reasoner: the reasoning engine to make decisions on detected error events
|
|
|
|
|
|
|
|
=== Design Decisions ===
|
|
|
|
|
|
|
|
==== Fault Containment Region in universAAL ====
|
|
|
|
|
|
|
|
As diagnosis involves the backtracking from Failure to Fault, the knowledge about the possible FCRs in universAAL platform are used in the fault analysis process. This fault analysis also uses knowledge about the failures in the both time and value domain. To gather this knowledge, the whole platform is divided into several Fault Containment Regions (FCRs) together with the specific failure modes that they can show. In the following, these FCRs are listed with appropriate possible failure modes and use cases.
|
|
|
|
|
|
|
|
universAAL platform can be formulated from the diagnosis point of view where the whole platform is divided into FCRs.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/rel1.jpg|500px|center|Fault Containment Regions in MW]]
|
|
|
|
|
|
|
|
In the following, a comprehensive list of Fault Containment Region with respective failure modes is listed. In this list, each of the FCRs is enlisted with its input, output and rationale so that the inclusion if this FCR is justified.
|
|
|
|
The failure modes for each of the components are classified as follows.
|
|
|
|
From the consistency point of view
|
|
|
|
#Consistent Failure
|
|
|
|
#Inconsistent Failure
|
|
|
|
From the time and value perspective
|
|
|
|
#Timing failure
|
|
|
|
##Early Timing Failure: eg. Babbling idiot
|
|
|
|
##Late Timing Failure: eg. Omission, Crash, Fail-stop
|
|
|
|
#Value failure
|
|
|
|
|
|
|
|
==== Use case and failure modes of the Fault Containment Regions ====
|
|
|
|
|
|
|
|
===== Hardware Faults =====
|
|
|
|
The Hardware faults that can be presented in the following user cases of the fault containment regions are:
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Node
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|A single node shares processor, memory, power supply. Single physical fault will affect other software components
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|•Omission failure: a node can stop sending or receiving signals (messages) to or form the physical channels. This failure mode appears as a late timing failure mode from our [1st level failure modes].
|
|
|
|
|
|
|
|
•Babbling idiot: a node can send untimely messages to the channel. This failure mode is also a late timing failure mode
|
|
|
|
|}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Communication channel
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Physical fault of the communication channel will lead to communication breakdown of the nodes that are connected through the channel. The value failure is not considered here as the error correcting codes are able to deal with the value failure. It is also assumed that the channel itself cannot create messages by itself.
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|Crash failure: a physical communication channel may not produce any output and this failure can remain undetected by the correct FCRs
|
|
|
|
|}
|
|
|
|
|
|
|
|
(*)Solely dependent on the application and/or specific implementation.
|
|
|
|
|
|
|
|
===== Software Faults =====
|
|
|
|
The Software faults that can be presented in the following user cases of the fault containment regions are:
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Operating System (OS)
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|An OS failure renders an entire node unusable because applications depend on core OS services for memory allocation, process management and I/O.
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|The failure modes for OS require a deeper understanding of the structure of the OS and are kept for future development.
|
|
|
|
|}
|
|
|
|
(*)OS is out of scope for this diagnosis implementation
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Middleware
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|As middleware acts a broker between the hardware and application software, failure to this will lead to total system failure
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
|•Input from AAL Space managers
|
|
|
|
•Input from AAL Space applications
|
|
|
|
|
|
|
|
•Packets from Ethernet (Input from layers below)
|
|
|
|
|
|
|
|
•Packages and messages from components (Input from layers above)
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
|•Context event for context manager
|
|
|
|
•Service invoke for the service manager
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| Provides two types of connection points
|
|
|
|
•Connection points between instances of middleware
|
|
|
|
•Connection points between the local components of the node and the system
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|•Omission failure: a middleware can send or receive messages or packages to or from the layers above and below. This includes a scenario where a piece of middleware is not delivering the required package to other piece of middleware. This failure mode is already handled by the current uPnP connectors
|
|
|
|
•Crash failure: a piece of middleware can omit its output for all subsequent input until it is restarted again. A suitable scenario for it is as follows. A new uPnP device has joined the middleware bus. If the middleware fails to produce the context event related to the joining of the new device to context bus and it requires that the middleware has to be restarted, then this scenario falls into a crash failure.
|
|
|
|
|
|
|
|
•Byzantine failure: this failure mode covers all the arbitrary failures that may occur to the middleware
|
|
|
|
|}
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | I/O Drivers
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Failure to I/O drivers will lead an I/O device not able to work as an I/O driver controls devices and maintain the data acquisition and visualization.
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| •Provides an interface for hardware like printers, video adapters, sound cards, network cards, digital cameras etc.
|
|
|
|
|
|
|
|
•For hardware:
|
|
|
|
a.Interfacing directly
|
|
|
|
b.Writing to or reading from a device control register
|
|
|
|
c.Using some higher-level interface (e.g. Video BIOS)
|
|
|
|
d.Using another lower-level device driver (e.g. file system drivers using disk drivers)
|
|
|
|
|
|
|
|
•For software:
|
|
|
|
a.Allowing the operating system direct access to hardware resources
|
|
|
|
b.Implementing only primitives
|
|
|
|
c.Implementing an interface for non-driver software
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|This FCR is related to the failure modes of OS which leads to check if it is in our scope of work or not
|
|
|
|
|}
|
|
|
|
(*)Solely dependent on the application and/or specific implementation.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Application components
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Failure to one application component is contained within that component
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| Application dependent
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
| *
|
|
|
|
|}
|
|
|
|
(*)Solely dependent on the application and/or specific implementation.
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | AAL Space Gateway
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Failure to an AAL Space gateway leads to intra and inter-AAL Space bridging mechanism failure and failure to one gateway is contained within it
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
|•Incoming request from a remote user (includes another AAL Space)to start and or publish its service and or service request
|
|
|
|
•Message from output bus
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
|•Authenticated or denied request to the remote user (includes another AAL Space) and or service provider
|
|
|
|
•A communication channel between the remote user (AAL Space) and the current AAL Space
|
|
|
|
|
|
|
|
•Message to input bus
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
|•Manages the activities among the Space Federation (intra-AAL Space) or inside the same space (inter-AAL Space)
|
|
|
|
•Acts as an IO handler within an AAL Space
|
|
|
|
|
|
|
|
•Provides certain communication service to other IO handlers
|
|
|
|
•Provides a mechanism to check the trustworthiness (authentication/deny request)
|
|
|
|
|
|
|
|
•Enable intra-space communications
|
|
|
|
|
|
|
|
•Log the AAL Space activities.
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|•Omission failure: An example scenario for this Omission failure is as follows. A remote device tries to connect to the AAL Space through the AAL Space Gateway. The gateway is not responding to the incoming request for the remote device to join the AAL Space. Another scenario would be if the gateway is omitting the messages that it received from the AAL Space output bus and it should pass those messages to the remote user (includes another AAL Space)
|
|
|
|
•Babbling idiot: an example scenario for this Babbling Idiot failure is as follows. The faulty AAL Space Gateway is constantly sending high priority messages to the AAL Space (specifically in the buses(Input bus))
|
|
|
|
|
|
|
|
•Value failure: an example scenario for this failure is as follows. A remote trustworthy user (including another AAL Space) tries to join the current AAL Space using the AAL Space Gateway, but the gateway is denying the remote connection
|
|
|
|
|
|
|
|
|}
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Connectors (of ACL)
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Protocol specific
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
| *
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| Application dependent
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
| *
|
|
|
|
|}
|
|
|
|
(*)Solely dependent on the application and/or specific implementation.
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | ACL (Abstract Connection Layer)
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Failure to ACL leads to breakdown of connectivity among the instances of the middleware
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
|Registration message from SodaPopPeer
|
|
|
|
Listener request from PeerDiscoveryListener
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
|Registration message of the P2PConnector to the underlying Hosting OSGI Framework
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| •Peer-discovery
|
|
|
|
•Creating proxies of remote implementations of SodaPopPeer
|
|
|
|
|
|
|
|
•Forwarding calls made to SodaPopPeer proxies to the real implementations of it on the side of remote peers.
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|•Fail stop failure
|
|
|
|
•Value failure: an ACL maintains a queue for the incoming request for the registration message from the SodaPopPeer. If the ACL fails to produce the correct registration message for the underlying OSGI framework, a value failure occurs.
|
|
|
|
|
|
|
|
•Babbling idiot: this failure mode includes a scenario that ACL produces a correct registration message for the P2PConnector, but it produces it very late
|
|
|
|
|}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | SODAPOP layer
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Failure to SODAPOP layer leads to disconnection between ACL and AAL specific layer
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
|Incoming calls from ACL to bind a peer
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
|Communication between the peers using the buses
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| •Finds peers by PeerDiscoveryListener interface
|
|
|
|
•Peers access middleware by SodaPopPeer interface
|
|
|
|
|
|
|
|
•Buses communicate with own peers by SodaPop interface
|
|
|
|
|
|
|
|
•Serialize and deserialize messages by MessageContentSerializer interface
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|•Timing failure: untimely (de-)serialization of messages; create the communication between the peers when one of them has already been absent
|
|
|
|
•Omission failure: the sender (receiver) SODAPOP layer fails to send(receive) the message
|
|
|
|
|
|
|
|
•Value message failure: the message contents does not comply with the interface specification
|
|
|
|
|}
|
|
|
|
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | FCR
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Virtual Communication Bus (Context, Service, UI)
|
|
|
|
|-
|
|
|
|
| Rationale
|
|
|
|
|Failure to the logical bus leads to cease of communication messages as the buses are the Connection Points towards AAL Specific layer
|
|
|
|
|-
|
|
|
|
| Input
|
|
|
|
|(De-)registration messages
|
|
|
|
|-
|
|
|
|
| Output
|
|
|
|
|Messages defined by BusStrategy abstract class
|
|
|
|
|-
|
|
|
|
| Service
|
|
|
|
| Management of the message queue
|
|
|
|
Propagate messages
|
|
|
|
|-
|
|
|
|
| Failure Modes
|
|
|
|
|•Value message failure: transmitted message do not comply with the interface specification
|
|
|
|
•Timing message failure: unspecified instance of message in time domain
|
|
|
|
|}
|
|
|
|
|
|
|
|
=== Implementation of The Diagnosis Framework===
|
|
|
|
==== Initial implementation from selected input projects ====
|
|
|
|
There were no initial implementation from the input projects.
|
|
|
|
|
|
|
|
==== Implementation Plan ====
|
|
|
|
In this section, an integrated detection and diagnosis framework is presented that can identify anomalies and find the most probable root cause of not only severe problems but even smaller degradations as well. Detecting an anomaly is based on monitoring uAAL component profile (see the following section on Error Detection Unit). Diagnosis is based on reports of previous fault cases by identifying and learning their characteristic impact on different performance indicators.
|
|
|
|
|
|
|
|
In common day terminologies, detection and diagnosis are hardly separated. Commonly, by the phrase “detecting a problem”, one often means two things actually: first, the confirmation that there is a problem at all and second, the verification of the nature or type of the problem itself. An example might be as follows. Sensors to register weight in the bed can activate the lighting of the route to the toilet, when the bed is left. But in one instance, it has been detected that the lighting does not activate although the bed is left because the weight sensor of the bed is generating no signal. The correct terminology in this example would be to say that an unusual behavior has been detected (i.e., the lights are not activated) and it is diagnosed that, e.g., the cause is a damaged sensor that has to be replaced. Detecting that the lights will not activate does not necessarily mean that there is a problem with the lighting itself; nevertheless, simply looking at the symptom level with this granularity it is impossible to tell if there is a serious problem with the lights or the master switch of the lights just has to be restarted. Therefore, if an unusual behavior is detected, a more thorough diagnosis has to be conducted in order to find out if there is actually a problem and what is the root cause behind it. Since the terms “detection” and “diagnosis” often carry an implicit duality, they have to be precisely defined before used in an engineering system such as an Ambient Assisted Living space: Detection basically means to identify something unusual in the network. However, in the context of the integrated framework in uAAL, the role of the detection process is only to provide a common view of possible indicators (symptoms) to the diagnosis to facilitate their correlation but deciding if there is a fault at all or what it is will be left to the diagnosis. Diagnosis means to investigate the root cause that could have caused the detected symptoms. In the framework, the input of the diagnosis is the output of the detection unit. The output of the diagnosis might as well be that there is in fact no problem at all. Usually, after the diagnosis of the root cause is done, certain corrective actions have to be performed in order to resolve the problem. Sometimes the root cause is harder to investigate than providing the action without knowing the underlying mechanisms; e.g., several failures can have a common corrective action (like restarting the sensor) but the root cause is unknown for the maintenance operator. It is even possible that the associated action is not a direct correction of the fault but the recommended escalation (e.g., alarming manual support line). Therefore, using the corrective action instead of the specific root cause is also acceptable. The root cause or the corrective action are what the diagnosis returns and they will be jointly referred to as the target of the diagnosis.
|
|
|
|
|
|
|
|
The integrated diagnosis framework uses the power of the Context bus in universAAL so that looking at any context event gives the indication of any symptom for a fault. It also uses the reasoning power of SPARQL and also uses the Publish/Subscribe model in universAAL. The integrated diagnosis framework is depicted in the following figure.
|
|
|
|
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/DiagnosisFramework.png|600px|center]]
|
|
|
|
|
|
|
|
From the context bus, the context events related to faults are taken as symptoms for a failure. These symptoms are analyzed by a priori knowledge of the FCR and the related static knowledge on the associated failure mode. These symptoms are further queried by Reliability Reasoner with the help of the KB (Knowledge Base) and [http://forge.universaal.org/wiki/ontologies:Dependability# Dependability Ontology]. These symptoms can be analyzed either in a rule based approach or simple SPARQL query. The rules for the failure analaysis are inside the Reliability Reasoner. Then the reasoner will publish the context event with the diagnosis information into the context bus. This diagnosis information includes the actions for the failure that have to be adopted for the specific failure modes for that specific FCR.
|
|
|
|
|
|
|
|
==Artefact #2 : Error Detection Unit ==
|
|
|
|
|
|
|
|
=== Backbox Description ===
|
|
|
|
In highly distributed system as in AAL environment, where a large number of hardware and software are contributed to serve a certain scenario, the probability of fault occurrence will be significant. Some of the provided services are critical and need to be served with relatively high reliability and availability i.e. the corporate components should provide at least a degraded level of this service even with fault existence. To tolerate the faults in such systems, three interrelated phases should be followed:
|
|
|
|
#Fault detection.
|
|
|
|
#Fault diagnosis.
|
|
|
|
#Fault masking and recover.
|
|
|
|
The first phase is responsible for detecting anomalies within a system, a pre-knowledge about the correct system behavior is required, such that any deviation from the normal behavior either in time or value domain can be caught easily by the error detection mechanism. Then, the detected anomalies would be analyzed and diagnosed by a certain diagnosis technique. After diagnosing the faults, a certain action should be taken either by recovering the faults online or offline or by blocking the fault and preventing them from elaborating to another healthy unit.
|
|
|
|
Because of its importance in fault tolerance operation, an Error detection framework has been created, the framework is based on classifying the exchanged messages within the network of the distributed system according to its specifications in time and value.
|
|
|
|
|
|
|
|
=== Bundles ===
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" colspan="2" | Artifact: ''Error Detection Unit''
|
|
|
|
|-
|
|
|
|
| GIT Address
|
|
|
|
| http://forge.universaal.org/svn/uaal_context/trunk/ctxt.error.detection.unit
|
|
|
|
|-
|
|
|
|
| Javadoc
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
| Design Diagrams
|
|
|
|
| [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Physical_distribution_of_EDU.png Physical Distribution of EDU], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Conceptual_model_of_EDU.png Conceptual Model of EDU], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Data_structure_in_EDU.png Data Structure in EDU], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Event_list_calendar.png Event List Calendar]
|
|
|
|
|-
|
|
|
|
| Reference Documentation
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
|}
|
|
|
|
|
|
|
|
=== Requirements ===
|
|
|
|
|
|
|
|
* '''RC9_TR1''' ''Modular Certification of Subsystems.''
|
|
|
|
* '''RC9_TR11''' ''Generic Fault-tolerance Layer.''
|
|
|
|
* '''RC9_TR12''' ''Tolerance of Software Errors.''
|
|
|
|
* '''RC9_TR13''' ''Bounded Start-up and Restart Time.''
|
|
|
|
* '''RC9_TR14''' ''Fault Classification.''
|
|
|
|
* '''RC9_TR21''' ''Application-specific Diagnostic Methods.''
|
|
|
|
* '''RC9_TR22''' ''State Enforcement.''
|
|
|
|
|
|
|
|
=== Features ===
|
|
|
|
EDU comes to enhance the reliability of the universal platform by discovering the faults of the exchanged messages in different domains. The discovered faults can then be forwarded to the diagnostic unit to take the suitable action. Several fault detection methods has been implemented in order to cover a wide range of faults. These methods may be classified as follow:
|
|
|
|
*Detecting the faults in time domain for both of the periodic and sporadic messages. These methods have the ability to detect the temporary and the permanent faults in time domain
|
|
|
|
*Detecting methods in semantic domain, several check process has been implemented (range check, 1st derivative). These methods have the ability to detect the temporary and the design faults n semantic domain.
|
|
|
|
In addition to the general purposes fault detection methods, EDU has been built in an extendable way to accept any other application specific check process.
|
|
|
|
|
|
|
|
=== Design Decisions ===
|
|
|
|
==== Message classification concept ====
|
|
|
|
The principle of Error detection by using message classification is introduced for the first by (Jones Kopetz) in Dependable System of Systems conceptual model (DSoS). The conceptual model of DSoS classifies the messages as shown in the next table . Before sending a message from one node to another one within the same network, code protective bits should be added to the message (e.g. CRC bits), then the output assertion on the sending node should verify the message. The message is classified as checked if it passed the output assertion. To be permitted the message has to pass the input assertion check of the destination node. The syntactic check should be done on the message by checking the codes bits to make sure that the message is still valid and has not been truncated. After that, the message should be checked against its receiving time and semantic to see whether the message is timely and correct therefore can be used further, otherwise the message will not be used.
|
|
|
|
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Attribute
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Explanation
|
|
|
|
! align="left" bgcolor="#DDDDDD" | Antonym
|
|
|
|
|-
|
|
|
|
| valid
|
|
|
|
| A message is valid if it contains a correct CRC.
|
|
|
|
| invalid
|
|
|
|
|-
|
|
|
|
| checked
|
|
|
|
| A message is checked if it passes the output assertion.
|
|
|
|
| not checked
|
|
|
|
|-
|
|
|
|
| permitted
|
|
|
|
| A message is permitted with respect to a receiver if it passes the input assertion of that receiver.
|
|
|
|
| not permitted
|
|
|
|
|-
|
|
|
|
| timely
|
|
|
|
| A message is timely if it is in agreement with the temporal specification
|
|
|
|
| untimely
|
|
|
|
|-
|
|
|
|
| correct
|
|
|
|
| A message is correct if it is in agreement with the temporal and the value specification.
|
|
|
|
| incorrect
|
|
|
|
|-
|
|
|
|
| insidious
|
|
|
|
| A message is insidious if it is permitted but incorrect
|
|
|
|
| not insidious
|
|
|
|
|-
|
|
|
|
|}
|
|
|
|
|
|
|
|
==== Conceptual model of Error detection unit ====
|
|
|
|
The next figure depicts a simple network, which consist of several universAAL aware communication nodes. EDU has been realized in each universAAL node as a separate software component by occupying the location between middleware and the application layer. EDU is not application specific, but it uses some functions from the underlying operating system to ensure its predictable behavior. However, EDU should be configured by the application developer to meet the specification of his application.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Physical_distribution_of_EDU.png|600px|center]]
|
|
|
|
|
|
|
|
EDU has been designed only to handle the received events by other uAAL-Components. Thus, whenever a uAAL-Component receives a new event from the context bus, it can deliver this event to the EDU to check the events against several fault type that should be predefined at the design time by the uAAL-Components itself. The physical location of the EDU on the receiving node will help the EDU in monitoring the sender status by analyzing its messages. Actually two design possibilities were available; whether putting EDU in the sending side or the receiving side. In some situation it becomes difficult for the sending node to judge itself. Suppose for instance that the sending node has mismatched the system synchronization due to a drift in its oscillation, in this case, it’ll be unreasonable to trust on the node’s decision whether the message timing is correct or not.
|
|
|
|
As mentioned in previous sections, EDU is relying on message classification concept to detect anomalies in the received messages. Next figure shows the follow of the received message inside the EDU, and how the message classification concept has been realized inside it.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Conceptual_model_of_EDU.png|500px|center]]
|
|
|
|
|
|
|
|
First of all the incoming message should pass the syntactic check to see if the received message is valid or not. In fact, the syntactic check tests if the received message has already been configured by the user. If not the message is dropped and doesn’t precede the other processes, at the same time an indication goes to the diagnostic unit to tell him about the invalid message. If the message is valid, a time check should be done to verify the timing of the message. Depending on the timing behavior of the different messages (e.g. periodic or sporadic messages), different time check algorithms may be required. The timely messages should finally pass the semantic check to make sure that the received message is error free. To check the message semantic, different software methods are available. Some of these methods are not application specific and can be applied generally like limit check, 1st derivative check , etc… while other methods require more information about the application like plausibility check, process model based check.
|
|
|
|
If the message has been dropped in any one of these check points, an indication is made to the diagnostic unit to take the suitable decision. However, to take an accurate decision, accurate information of the caught anomaly should be provided from the error detection unit. This information should contain the error type, location, and time to help the diagnostic unit in taking the right decision easily.
|
|
|
|
The faulty messages that are generated by the EDU on each node, are finally published on the context bus, see Figure 1. Diagnostic unit should be able to subscribe to all of these events from the different nodes. Physically, the Diagnostic unit should be in one centric node, and this node should have the ability to connect to all distributed nodes by using a suitable networking.
|
|
|
|
In order to realize the EDU, several assumptions and requirements should be taken in consideration before getting in the implementation phase:
|
|
|
|
*Deterministic behavior: as check functions in EDU are relying on pre-knowledge information in both time and value domain, a deterministic behavior for both of the middleware and the communication infrastructure is required to ensure message consistency.
|
|
|
|
*Synchronization among the communication nodes: in order to have a unique timing view, the senders and the receivers of the messages should be in synchrony.
|
|
|
|
*Syntactic check within communication Architecture: the framework assume that no fault can happen to the message content in communication network i.e. the communication network has the ability to catch the syntactic fault (e.g. flipped bits, truncated message) by implementing some type of check function (e.g. CRC check).
|
|
|
|
*Extendibility: The framework should be expendable to adapt to any new error detection mechanism.
|
|
|
|
|
|
|
|
=== Implementation ===
|
|
|
|
==== Initial implementation from selected input projects ====
|
|
|
|
There were no initial implementation from the input projects.
|
|
|
|
|
|
|
|
==== Implementation Plan ====
|
|
|
|
Before getting into the practical elements of the EDU and how these elements have been implemented, several design aspects might be clarified at first. One of the most important feature in EDU, is its ability to detect the errors in time domain, the error detection mechanism in time domain should be accurate enough to handle the timing errors in high resolution. Thus a high resolution time stamping and timer mechanisms need to be used. To cope with this issue, specific timing functions have been utilized from the OS (Linux OS in our case). Although this procedure might made the EDU a non-portable code, equivalent timing functions of other OS can be found and replace that of Linux.
|
|
|
|
Because of that and to make the code more flexible, the main core of the EDU has been implemented in C Language and provided with native methods to make the interface to the middleware which is already implemented with java.
|
|
|
|
To achieve the message classification inside the EDU, a pre-knowledge about the message specifications both in time and in value domain are required. These specifications should be delivered by the application developer at the design time and before the using of EDU. An XML configuration file has been created to make it easier for the developer to give the specification of its message. To manipulate the specification of the message, a parser function has been created for parsing the information from the XML file and providing them to the main data structure of the EDU.
|
|
|
|
The data structure inside EDU, consists mainly of a hash table that comprises the message ID as a key and a list of check processes’ structures as a value to the related key, see next figure , where each message should pass a number of check points that are associated to the related message during the design phase. Suppose a certain message which has the message ID “101“as in the next figure. Message 101 is supposed to be configured as periodic message and have an integer value that should be tested against a certain threshold by applying the limit check and the 1st derivative check processes, therefore three check processes should applied on this message. By finding out the message ID in the hash table, a pointer to the head of the check processes list will be returned as value, for our case, the pointer refer to the periodic field. The periodic related information of message 101 such as the period value and the phase value will be found in its structure instance (Periodic struct.). After that a periodic check function will be called to compare between the stored time information and the time information that is extracted from the received message, if message 101 met its time specification, then it is considered as timely otherwise untimely message indication may be given to the diagnostic unit.
|
|
|
|
By terminating the periodic check function, the pointer will refer to the second check process and so on until the check processes list is finished.
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/ Data_structure_in_EDU.png| 500px| center]]
|
|
|
|
|
|
|
|
===== Fault detection mechanism in time domain =====
|
|
|
|
To cover fault hypothesis in time domain, two types of messages may be distinguished according to its timing behavior:
|
|
|
|
*'''Periodic messages:''' this type of messages has a fixed period after which a new message of the same type should be transmitted over the network. One periodic message could be transmitted more than one time within one cluster cycle, therefore periodic messages with the same period but at different phases may be differentiated.
|
|
|
|
*'''Sporadic messages''': in contrast to periodic messages, the sporadic messages have to be re-generated within a certain range of time. In other words it has a minimum and maximum inter arrival time and it should be re-transmitted within this range.
|
|
|
|
|
|
|
|
For both types the messages could be timely when it meets its specification or it could be represented by untimely if one of these situations happens:
|
|
|
|
*Message came early.
|
|
|
|
*Message came late.
|
|
|
|
*Message didn’t come.
|
|
|
|
The first two errors can be detected depending on event triggered mechanism by comparing the receiving instance of the message with the pre-known message specification. In case of message losing no event would be generated, therefore another mechanism has been used that depends on time triggered paradigm. The time triggered mechanism is summarized in setting a time out for the waited event by setting the deadline of receiving instance. If the time is spent without receiving a message an indication is made to the diagnosis unit. This type of error detection mechanism helps in detecting the omission failure where a permanent fault could occurs.
|
|
|
|
In order to schedule an event with time out in the future, an event list data structure has been created. The event list which is also called an event calendar handle the events of different messages with maintaining all of the events in time order so that the next event may be readily determined when the current one has completed execution. During the execution of each event, new event may be scheduled.
|
|
|
|
A single event handler has been used to handle all the events of incoming messages. This step has been adopted to avoid the conflict between periodic and sporadic events in case they share the same schedule point of time. Additionally, it is not reasonable to generate an event handler for each message.
|
|
|
|
The event calendar consists mainly of the message ID and the expected next arrival time of the related message as shown in Figure 4. the expected next inter-arrival time for periodic message may be computed by
|
|
|
|
Time schedule periodic = current time + message period
|
|
|
|
For Sporadic
|
|
|
|
Time schedule sporadic = previous arrival time + max interarriaval time
|
|
|
|
It could be seen that the next scheduling point of time for sporadic message depend directly on the previous receiving point of time. Therefore a static list data structure has been created inside the sporadic check function to maintain the previous time stamping of different sporadic messages from lost. In same manner if previous data are required within the current test function, a static list may be generated, each element of the list contain the message ID field and the previous data that are required.
|
|
|
|
The calendar re-arranges itself dynamically after each message arrival in such a way that the earliest schedule time occupies the head position of the list.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/ Event_list_calendar.png| 400px| center]]
|
|
|
|
|
|
|
|
===== Semantic fault detection mechanism =====
|
|
|
|
If the ensuring of the deterministic behavior for both the middleware and the communication infrastructure will help a lot in classifying faults regarding time, this will not be the case when a sensor or actuator deviates from its normal operation. It is more complicated to catch an error from the message semantic. However a wide vary of methods are already introduced to detect anomalies of a certain process. These methods may be classified as already done by Isserman in <ref>Isermann, Rolf. Fault Diagnosis System. Heidelberg : Springer, 2006.</ref>.
|
|
|
|
*Signal based fault detection: by taking the measured signal several criteria may be applied directly e.g. limit checking or trend checking or may be by analyzing the measured signal a certain specification can be estimated and then tested.
|
|
|
|
*Model based fault detection: this method is more complicated, it takes the measured signals for both input and output for a certain process and apply them to the mathematical model of the process. Then several features can be estimated e.g. parameters, state variables or residuals. By comparing these observed features with their nominal values analytical symptoms are generated.
|
|
|
|
|
|
|
|
Three fault detection methods based on directly measured signal have been realized in the framework:
|
|
|
|
*'''Limit checking:''' each measured signal Y(t) is normally bounded by one or two thresholds Y_min and Y_max. If the signal exceeds one of its thresholds then anomaly may be detected. Of course, normal fluctuation could occur, then a false alarm should be avoided, however the fault on the other hand should be detected early. Therefore a trade off between too narrow and too wide threshold exists. To use this check function in the framework, a special C data structure has been identified to save the message ID and the related max. and min. threshold. To apply the limit check function for a certain message, an instance of the related structure should be initiated and the limit checking process should be inserted to the List of check process.
|
|
|
|
*'''Trend checking:''' the same principle for limit checking may be applied for first derivative of the measured value Y'(t) by setting the minimum and the maximum limits for the trend. Trend checking could detect the fault earlier. To compute Y'(t) the previous value Y(t)_old and its time stamp are required. The issue of handling the previous values that are related to a certain message has been automatically treated and the user only has to inserts its threshold in a specific data structure.
|
|
|
|
*'''Application specific Plausibility check:''' when multiple measurements are available for the same process, a relation between these measurements may be establish to be a base for further checking. As an example for plausibility check suppose a process with two measured variables X(t) and Y(t). Under the normal condition the following relationship should apply: Since this type of checking process is application specific, it was very difficult and invaluable to present it as a general function in the framework, however a use case example has been verified within the framework.
|
|
|
|
|
|
|
|
*If (Y_min< Y(t)< Y_max) Then (X_min<X(t)< X_max)
|
|
|
|
|
|
|
|
The fault types that have been covered by the EDU are classified as general faults that are not related to a specific application. Many other application specific fault detection methods can be inserted to the EDU by the developer of uAAL components by utilizing the data structure of the EDU.
|
|
|
|
|
|
|
|
==Artefact #3 : Testing Module of systems fault-tolerant using Fault Injection Framework ==
|
|
|
|
|
|
|
|
=== Blackbox Description ===
|
|
|
|
The Fault Injection Framework is used on a distributed universAAL Platform to test the system’s reliability and safety. The test case [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/1.png Cluster] consists of 5 nodes that implement AAL application on UniversAAL platform. The nodes are connected by two main communication channels; i.e. Ethernet communication and Time Triggered Ethernet communication channels, and controlled by server.
|
|
|
|
|
|
|
|
The framework includes the following:
|
|
|
|
* Development of experimental framework with the following specifications:
|
|
|
|
** Linux Based Execution Environment that is able to work within real time communication. The nodes must operate under real time Linux operating system. This type of operating systems with special kernel modifications facilitates real time communication between the node and its environment.
|
|
|
|
** Ethernet Communication between server and nodes: the system must serve the Ethernet communication between the nodes and the server. As the server needs to control the nodes within the network, the Ethernet communication will allow the server to send instructions to the nodes and receive results from them
|
|
|
|
* Execution environment on nodes for experiment: The framework must be equipped with special configuration that represents a suitable environment for experiments on the UniversAAL platform. This includes:
|
|
|
|
** TTEthernet configuration on both nodes and switch
|
|
|
|
** Ethernet Configuration on both server and nodes
|
|
|
|
* Experimental process on server and nodes: The system construction must facilitate the controlling process of the server on the nodes using special tools. The server should be able to:
|
|
|
|
** Assign tasks and transfer it to the nodes
|
|
|
|
** Run the tasks on the nodes
|
|
|
|
** Receive logs for results
|
|
|
|
* Real time experimental test application: the application must give an example of real time communication between nodes. It should be able to perform the following:
|
|
|
|
** Task Assignment from the server to the nodes.
|
|
|
|
** Real time communication between the nodes during the task execution.
|
|
|
|
** Collecting and sending of results from the nodes to the server.
|
|
|
|
* AAL application: The nodes must run AAL components under the universAAL platform.
|
|
|
|
* Fault Injection: The system must be tested under software fault injection to deal with safety and reliability issues. Several experiments must be implemented under different fault injection scenarios.
|
|
|
|
|
|
|
|
=== Bundles ===
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" colspan="2" | Artifact: '' Testing Module of systems fault-tolerant using Fault Injection Framework ''
|
|
|
|
|-
|
|
|
|
| GIT Address
|
|
|
|
| http://forge.universaal.org/svn/support/trunk/reliability/Fault%20Injection%20Framework
|
|
|
|
|-
|
|
|
|
| Javadoc
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
| Design Diagrams
|
|
|
|
| [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/8.png Sending Algorithm], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/9.png Receiving Algorithm], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/11.png Framework Launch script], [http://forge.universaal.org/mediawiki/images/thumb/9/9e/12.png/400px-12.png Server Script], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/13.png Node Side Script]
|
|
|
|
|-
|
|
|
|
| Reference Documentation
|
|
|
|
| http://forge.universaal.org/wiki/support:RD_Fault_Injection
|
|
|
|
|-
|
|
|
|
|}
|
|
|
|
|
|
|
|
=== Requirements ===
|
|
|
|
In the flowing, the list of the related requirements and there status is presented:
|
|
|
|
* ''' RC9_TR2''' ''Design for Testability''
|
|
|
|
* ''' RC9_TR3''' ''Correctness-by-Construction''
|
|
|
|
* ''' RC9_TR7''' ''Fault Hypothesis''
|
|
|
|
* ''' RC9_TR12''' ''Tolerance of Software Errors''
|
|
|
|
* ''' RC9_TR14''' ''Fault Classification''
|
|
|
|
* ''' RC9_TR16''' ''Worst Case Execution''
|
|
|
|
* ''' RC9_TR18''' ''Diagnostic Service''
|
|
|
|
|
|
|
|
=== Features ===
|
|
|
|
|
|
|
|
Enhance the systems testability by using a fault injection framework to ensure the platform reliability and tolerant to faults that may occur during run time.
|
|
|
|
|
|
|
|
=== Design Decisions ===
|
|
|
|
|
|
|
|
===== Model Concept =====
|
|
|
|
In this section the model concept will be discussed together with its main construction and functionality. In following figure the model concept is illustrated:
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/2.png|400px|center]]
|
|
|
|
|
|
|
|
|
|
|
|
* Controller: The controller rule is to instruct and direct the fault injection tool. The controller will initiate the fault injection process, and will receive the results of each experiment implanted.
|
|
|
|
* Fault injection tool: This part is responsible of initiating the faults and injects them into the AAL components. It will collect the results and send it back to the controller.
|
|
|
|
* Communication Environment: This part is responsible of the communication between the AAL components. It could be Ethernet connection or Real-Time communication system such as Time Triggered Ethernet.
|
|
|
|
* AAL components: These are the components of the system. A node can run one or more components on UniversAAL platform.
|
|
|
|
|
|
|
|
==== Model specifications ====
|
|
|
|
In this section the model which is used in our framework is presented. The framework uses the universAAL platform as it is an AAL platform. Thus the framework is designed in such a way that it serves for the AAL applications within the universAAL platform which differentiates it from the other existing frame works. It has its own specification that facilitates the integration of the universAAL platform.
|
|
|
|
# Distributed system: The AAL applications can be implemented on distributed system where each part of this system runs a different AAL application and communicate with the other parts. Thus the used framework consists of several nodes that run AAL applications communicating with each other.
|
|
|
|
# Ethernet Communication: The universAAL platform components have their own communication environment that uses the Ethernet protocol for its communication. The nodes within the model are able to communicate between each other using the Ethernet communication.
|
|
|
|
# Real-Time Communication: The development for the universAAL communication system and applications introduce the use of the real time communication system within the communication environment. The model serves this functionality by providing the ability to use real time communication within the communication system between its components.
|
|
|
|
# Fault Injection: The model must be able to apply fault injection application within the universAAL platform to serve for fault tolerance application.
|
|
|
|
|
|
|
|
==== The Model Node Construction ====
|
|
|
|
The nodes used in the model consist of one single component at each node, The AAL components are nodes that are configured to run the UniversAAL platform. These nodes must contain important tools in order to achieve the specified services.
|
|
|
|
* Real Time Operating System: The model was constructed in such a way that it considers the Real-Time Communication. In order to serve this functionality the nodes must run on a real time operating system, which will provide the operating system requirements for the real time communication system.
|
|
|
|
* Real Time Configuration Tool: This tool is used to change the mode of communication between the nodes to Real-Time Communication mode whenever this needed; it will load the Real-Time Configuration file to the operating system kernel and will configure the network interfaces.
|
|
|
|
* Ethernet Controlling Tool: This tool will facilitate the communication between the nodes and the Controller, and will provide the ability to control the operations and instructions running inside the nodes by the server.
|
|
|
|
* Real Time Communication Tool: It is used in order to facilitate the real time communication between the nodes and to provide the real time specifications.
|
|
|
|
* AAL Platform: The universAAL platform is used as standard AAL platform at which the nodes will run its AAL applications.
|
|
|
|
|
|
|
|
==== Sequence of Actions of the universAAL-Based Fault Injection Framework ====
|
|
|
|
In this section the proposed and implemented UniversAAL-Based Fault Injection Framework is illustrated with its concept and sequence of phases. Programmed Software is used in order to inject faults in the system. Our framework has the ability to run within the universAAL platform. It injects faults, collects the results and sends it to the server. Details regarding all Fault Injection sequence phases are presented in the following figure:
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/f3.png|600px|center]]
|
|
|
|
|
|
|
|
* Start up: The Server (Controller) sends a startup command to the nodes synchronously, in order to initiate the fault Injection implementation inside the nodes.
|
|
|
|
* Updating the specified task and clean last results files: Each node has a list of tasks that are assigned from the server, and a log of files created after the tasks execution containing the results. After receiving the start up command form the server, the nodes will start to prepare for the implementation. This includes updating the new list of tasks from the server and cleaning all the log files to be ready for the new results.
|
|
|
|
* Initiate the universAAL platform and join the bus: The model run within the universAAL platform and uses its bus model for communication. At this stage each node will start to initiate the universAAL Platform that is required to run the AAL application. This includes initiating the universAAL layers and bundles, joining the platform bus and discovering the other nodes connected to the bus.
|
|
|
|
* Run the application and save the results on the local nodes: At this phase, the nodes will startup the updated task. During this task the nodes will communicate with each other and will be disturbed by fault injecting process and this fault injection process is done programmatically inside the specified task. During this period the nodes will run several experiments and produce several results. These results will be saved synchronously within the running phase in the local nodes.
|
|
|
|
* Exit the bus and shutdown the universAAL platform: After finishing the implementation, each node will leave the bus followed by closing the connection. Finally the universAAL platform layers and bundles will be shutdown.
|
|
|
|
* Send the results to the server for each node: The nodes will send the results of the implementation to the server as files.
|
|
|
|
* Collect and arrange all the results collected from the nodes in one folder: The server will start to collect the received results from the nodes. Then it will rename, arrange and process them according to the nodes number.
|
|
|
|
|
|
|
|
==== Usage of the Fault Injection Framework for Diagnosis ====
|
|
|
|
This fault injection framework provides the starting point for the current development of the diagnosis framework for universAAL. Based on the components’ behavior and fault statistics provided by the fault injection framework, the knowledge of the symptoms that lead to fault can be formulated and based on these symptoms behavior, the fault can be classified by the diagnosis system. The relationship between the outcome of the fault injection framework and the diagnosis framework can be depicted in the following figure.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/14.png|400px|center]]
|
|
|
|
|
|
|
|
The results that we get from the fault injection provide observation that builds up the symptom for diagnosis. The intermediate events between fault and symptoms are not always visible based on the symptoms behavior. So this symptom-event-fault chain will provide the path to diagnosis and the current diagnosis framework will handle these symptoms in such a way that the fault can be correctly classified and so the diagnostic measure for that specific fault can be realized in universAAL. This will also facilitate the use of situation reasoner to create, publish and consume the context event related to different faults. Future work includes the identification of the different fault scenarios, implementing different error detectors and implementing the diagnostics framework for universAAL.
|
|
|
|
|
|
|
|
=== Implementation ===
|
|
|
|
In order to build the model that was presented before, a certain procedure must be followed. In this chapter the framework’s implementation procedure is presented including all the information about the required hardware and software construction. To be able to test and use the framework, UniversAAL-based cluster is built.
|
|
|
|
|
|
|
|
==== Initial implementation from selected input projects ====
|
|
|
|
There were no initial implementation from the input projects.
|
|
|
|
|
|
|
|
==== Implementation Plan ====
|
|
|
|
'''Setup an Emulation Environment on a Distributed Cluster:'''
|
|
|
|
|
|
|
|
In this sub-section, all the required information about the distributed system that is used is provided; it includes all the hardware and software description and implementation.
|
|
|
|
|
|
|
|
===== Hardware =====
|
|
|
|
Please refer to the Reference Documentation for the Hardware set up of the test case.
|
|
|
|
|
|
|
|
===== Software Construction =====
|
|
|
|
In our system, the nodes must be configured with special software utilities in order to make the system able to achieve the specified tasks (i.e. run the universAAL Platform, run AAL applications).
|
|
|
|
The Real-Time operating system is one of the most important parts, for the system that we will use, the Linux real time operating system (RT-Linux) was chosen, even though a lot of efficient real time operating systems can be used such as RTAI, but since; the used Time Triggered switch’s drivers and configuration instruction are compatible for RT-Linux, the installation and configuration of RT-Linux are much easier, the RT-Linux is more used and has often updates and maintenance service , the RT-Linux is chosen.
|
|
|
|
|
|
|
|
The Secure Shell protocol (SSH server) utility is used to have a direct and secure access to the nodes, which makes it easy to enter, control and modify from the server part. The Network File System (NFS) utility is used to achieve the mount process between the nodes and the server, the mount process facilitates sharing folders between the nodes and the server which can be used to transfer files between them. The Real Time communication require special configuration files to be loaded to the Linux kernel in the nodes part, and special configuration on the TTEthernet Switch.
|
|
|
|
Finally the nodes run AAL components under the universAAL platform which must be included within the nodes
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/5.png|300px|center]]
|
|
|
|
|
|
|
|
==== Fault Injection Mechanism ====
|
|
|
|
The Software fault injection application is used in order to execute and enhance fault tolerant mechanisms. The application is designed to run using the universAAL platform; it uses the platform for its operation and communication. This application is designed in such a a way that allows the user to implement fault injection process. For this purpose, the program specifications must be designed in a way that will make it easier to the developer to inject faults in the application to test the nodes behavior.
|
|
|
|
|
|
|
|
===== Track the Application Behavior =====
|
|
|
|
The application execution will be initiated from the server; the server will send the instructions synchronously to the nodes which will start running the application. At the end, the results will be collected and sent to a specified results folder in the server in a log file.
|
|
|
|
|
|
|
|
The time line for the application as it is illustrated in section 3.2.4 shows the main steps that are implemented:
|
|
|
|
# The developer can start the application through a specified command in the server, the server will connect to the nodes using “Secure Shell Server (ssh server)” utility and run and certain scripts inside the nodes.
|
|
|
|
# The nodes will start updating the task assigned from the server using “Network File System (NFS server)” utility.
|
|
|
|
# The universAAL platform will be initiated and the nodes will join the bus and start recognizing other nods on the bus.
|
|
|
|
# After stabilizing the bus service, the Send-Receive application will start up and the results for each iteration will be saved locally by the nodes in a log file.
|
|
|
|
# When the execution is finished the node will exit the bus and shutdown the platform.
|
|
|
|
# The nodes will send the results to the server using NFS server utility.
|
|
|
|
# The server will collect the results in one folder.
|
|
|
|
|
|
|
|
===== Application Specifications =====
|
|
|
|
The application with special specifications that allows the use of different testing scenarios for fault injection process, and facilitates the application execution.
|
|
|
|
|
|
|
|
''Message Contents''
|
|
|
|
|
|
|
|
According to the universAAL platform, the data transfer between nodes depends on events, the event contains information about a device; like the status of the device or changes in the device properties.etc. For this reason a virtual device should be created, this device represents the source of the event, the event now gives information about this device. For simplicity the device was chosen to be gauge, and the property that we want to send its status by the event, is the change of battery level.
|
|
|
|
The device should have unique property which differentiates each device from another, this property can be used in the receiving part nodes to express interest to specific events type that are sent on the bus; If this event is within the receiver interest, the node will receive this event.
|
|
|
|
|
|
|
|
Multiple devices must be used, to create more than one type of events, these events are:
|
|
|
|
* Startup event: to notify the receiving nodes that the process of sending events is starting up or a new loop of sending events is starting up.
|
|
|
|
* Actual event: it is the event that will be counted and considered for the fault Injection application. This event will be sent according to the time and rate specification.
|
|
|
|
* End event: this event is used to notify the receivers that the event sending loop is. And according to that, the receivers will initiate new receiving session.
|
|
|
|
* Exit event: this event is used to end the testing operation.
|
|
|
|
Each of the devices has a unique source label. This label will be used to differentiate the source of the event in the received part.
|
|
|
|
|
|
|
|
''Send-Receive methodology''
|
|
|
|
|
|
|
|
According to the universAAL platform, sending events to the bus is done using (Context Publisher) , the application must initiate this object and connect it to specific provider, this provider represents the source of the publisher. Once the Publisher is initiated, it will be connected to the bus and it will be ready to send events. After finishing the task, the context publisher must be closed and disconnected from the bus.
|
|
|
|
Receiving the events is done using (Context Subscriber) object, the application must initiate the context subscriber and define its restrictions; these restrictions are defining the interest of the subscriber, so if the events sent to the bus are in the subscriber interest, the event will be received in the node, otherwise nothing will be received.
|
|
|
|
To make the application more efficient, the sending and receiving procedure is done on different threads, which allows the developer to control each thread separately.
|
|
|
|
|
|
|
|
''Timing''
|
|
|
|
|
|
|
|
In the application there are different timing aspects depending on the sending and receiving procedure:
|
|
|
|
* Event Sending Rate:
|
|
|
|
The rate at which the node sends events has a major effect on the reliability and efficiency of the sending-receiving process, and it is one of the fault injection scenarios that can be used. This rate must be defined as a variable in order to be controlled by the developer. In our application, where we have two sending nodes, the rate of sending events can be controlled separately and adjusted according to the test specifications.
|
|
|
|
* Minimum Event Sending Rate:
|
|
|
|
Sending events to the bus depends not only on the sending instruction execution time, but also the delay of the bus; once the event sending command executed the time for the execution consists of the actual execution time and the delay caused by the bus.
|
|
|
|
In order to choose the effective and suitable sending events rate, several experiments were executed on the nodes. Through these experiments the number of the nodes connected to the bus were changed with every trail; starting with 1 node and ending with 5 nodes, while node1 is sender and other node are receivers. The average of the event sending rate was calculated and the results are shown in the following graph.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/7.png|500px|center]]
|
|
|
|
|
|
|
|
According to these results the minimum event sending rate must be 4.5 event/ms.
|
|
|
|
* Inter burst time :
|
|
|
|
Between each loop of sending events, there must be a pause time; this time is a safety time period which used to give flushing time to clean the buses, to ensure that all the events that are waiting it the bus are sent and received. For each event there is a waiting time where the events waits in the bus queue, if the sending publisher were terminated before finishing receiving these events, some of the events will be lost. This time is very important in the fault injection procedure and can be tolerated to test the nodes behavior.
|
|
|
|
* Iteration Time :
|
|
|
|
Send events loop is controlled by a predefined running time. In the application, this time is set to be variable and can be adjusted according to the test specification.
|
|
|
|
|
|
|
|
''Results Output''
|
|
|
|
|
|
|
|
In order to make it easier to collect and process the results of the sending-receiving operation, the results must be saved in a separate file that contains the results of each loop. Each node has its own results file which contains information according to its behaviour (Sending or receiving):
|
|
|
|
* Sending Results: the results information that must be included in the results file are:
|
|
|
|
** Iteration Number: specifies the sending loop number, and it is increased by each new loop. The start and the end of the iteration are specified by the sender node.
|
|
|
|
** Start Sending time (only for the event sending nodes): this time is used to stamp the loop starting time in the sender node.
|
|
|
|
** End Sending Time (only for the event sending nodes): this time is used to stamp the loop ending time.
|
|
|
|
** Number of Events: this number indicates the number of events that was sent by the node.
|
|
|
|
* Receiving Results: the results information that must be included in the results file are:
|
|
|
|
** Iteration Number: specifies the receiving iteration number, and it is increased by each new iteration. The start and the end of the iteration are specified by the sender node.
|
|
|
|
** Start Receiving time: this time is used to stamp the receiving loop starting time in the receiving node.
|
|
|
|
** End Receiving time: This time is used to stamp the receiving loop end time in the receiving node.
|
|
|
|
** Number Of Events: this number is indicates the number of events that was received by the node.
|
|
|
|
In order to make it more efficient for reading and handling the results, the information must be saved in a log file and delimited with “Tabs”.
|
|
|
|
|
|
|
|
''Using the runners for execution''
|
|
|
|
|
|
|
|
The application must initiate the universAAL platform and run the application without using graphical user interface, this will make it easier and will give the ability to the developer to run the application without user interface (run using ssh server), and also this allows the server to control and run the application using automated scripts.
|
|
|
|
Pax runner utility can be used to create “Felix” frame work, where the universAAL platform bundles can be started.
|
|
|
|
|
|
|
|
''Scripting''
|
|
|
|
|
|
|
|
In order to make it more efficient, the whole fault injection application must be implemented using scripts, which automates the whole procedure and allows creating complex fault injection testing:
|
|
|
|
|
|
|
|
* Server script: it is necessary to start the application simultaneously on all the nodes, so all the node will have same running time. Also it is more efficient to run the application using one command, rather than logging to each node and start the application form it local drive.
|
|
|
|
The server script must contain all the commands required to start the application at each node at the same time, also it must not wait for the results, otherwise it will be not simultaneous executed. On another hand the output must not interrupt the whole procedure and thus the application will run on the nodes in background.
|
|
|
|
* Nodes Script: this script must be on the nodes , it aims to:
|
|
|
|
** Get the last update for the application from the server using nfs server
|
|
|
|
** Startup the pax runner, the pax runner will initiate the universAAL platform and run the application.
|
|
|
|
** Send the results file to the server and clean the file for new test.
|
|
|
|
|
|
|
|
''Fault Injection Scenarios''
|
|
|
|
|
|
|
|
There are several scenarios for fault injection that can be implemented:
|
|
|
|
* Changing the pause time, this will affect the events waiting time in the events queue inside the bus..
|
|
|
|
* Disturbing the bus by creating faulty behaviour in one node, this will lead the node to connect and disconnect to the bus for several times aimlessly.
|
|
|
|
* Change the number of sending nodes and their event sending rate.
|
|
|
|
* Stop a node during the receiving process.
|
|
|
|
* Initiating multiple context subscriber or context publisher
|
|
|
|
|
|
|
|
===== Fault Injection Process Model =====
|
|
|
|
Through this section the construct steps for the fault injection application will be presented. The first step in the creation of the fault injection application, is to define the main programs that will be used for sending and receiving the events, they are separated into two programs ; Send program (which sends the events) and Receive program (which receives the events). Then a pax runner script will be used to initiate the Felix frame work that will launch the universAAL platform. At the same time some scripts must be programmed on both the server and the nodes to make the process automated and controlled by the server.
|
|
|
|
|
|
|
|
'''Programs and scripts description'''
|
|
|
|
* Send Program :
|
|
|
|
This program is used to send the events from specific node to the other nodes, the program is design in such a way that allows the user to implement different scenarios of fault injection by changing the program parameters, find the send program framework description in following [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/8.png diagram].
|
|
|
|
|
|
|
|
The Sending program phases are:
|
|
|
|
|
|
|
|
# Initiation process : at this stage the program will initiate some essential objects that include; the context publisher that is responsible for sending the events to the bus ,and the context provider that represents the source of the context publisher.
|
|
|
|
# Creating the virtual devices: here the virtual devices which the events represent it’s status are created.
|
|
|
|
# Generating the events: now the devices are ready to issue their status, the events that connected to the devices status will be defined and ready to be sent.
|
|
|
|
# Checking the loop number: according to the program scenario the user will provide the number of experiments that must be done within each execution, this number also represents the number of the iterations that the program will perform, if the total number of the loops are done, the program will proceed to the exit process otherwise it will continue with the next step.
|
|
|
|
# Send start event: after entering the loop, and before start sending the status message, the program will send the starting event, to inform the other nodes that it is starting a new loop of sending events, in order to make these nodes ready for new running loop.
|
|
|
|
# Save time stamp: the time of starting a new loop, will be saved in a predefined variable, to be added to the results file at save results step in order to use it in the results processing.
|
|
|
|
# Delay: this parameter is responsible for the delay between sending two sequential events, by manipulating this parameter, the user can change the sending event rate which can be used in the fault injection process.
|
|
|
|
# Send status event: at this stage the status events will be sent.
|
|
|
|
# Check for the loop time: each loop of sending status event are controlled by the loop period, this period is specified by the user. If this period is finished, the program will exit the loop and save the results, otherwise it will go back to the sending status events phase.
|
|
|
|
# Wait after checking: when the sending status event phase finished, the program will give time to the events that are waiting on the queue in the bus to be sent and to clean the bus form the queued sent events. This parameter of time can be a possible fault injection factor.
|
|
|
|
# Save Time stamp: the time stamp of ending the loop is saved in a particular variable in order to be used in saving results phase.
|
|
|
|
# Send end loop event: the sending node sends to the other nodes the ending event to inform them that his sending loop is finished.
|
|
|
|
# Save Results: the results of the sending process will be saved. These results include; loop number, the time stamp of stating each loop, the time stamp of ending each loop and the number of events that was sent.
|
|
|
|
# Wait after save: this “wait-time” is used to give time gap between each loop of sending events, in order to give a period of time to the other nodes to save the results of the last loop and get ready to the new one.
|
|
|
|
# Send End event: when the whole operation of sending the events finished and the experiment is finished, the node will send the end process event to the other nodes to inform them that the experiment is finished.
|
|
|
|
|
|
|
|
* Receiving Program
|
|
|
|
The [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/9.png Receive Program] is used by the nodes in order to receive the events that are sent by the other nodes. It consists of two parts; the main part which initiates all the objects that are needed for the receiving task, and the csubscriber which is the actual part that is responsible for receiving the events.
|
|
|
|
|
|
|
|
The receiving program description is as follow:
|
|
|
|
|
|
|
|
# Start phase: at which the program initiate.
|
|
|
|
# Wait: at the starting phase, the receive function waits for a specific period of time. This time is used to give the frame work enough time to be registered on the bus and to recognize any other nodes that are connected on the bus.
|
|
|
|
# Initiation process: at this stage the program initiates the required objects for the receiving process, this include the context event pattern, which will specify the restriction on receiving the events type defined by the context subscriber.
|
|
|
|
# Create Csubscriber: this step will create context subscriber object, which is responsible for receiving the events.
|
|
|
|
|
|
|
|
The context subscriber object, uses another thread to receive the events, it has several functions each one is responsible for a special task, the most critical part of our implementation is the [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/10.png “handle context event”] function, which is modified in order to work accordingly to Framework requirements.
|
|
|
|
|
|
|
|
# Receiving events: this function is executed whenever an event is sent to the bus and it’s within the Csubscriber interest, the first step is to receive this event and identify its properties.
|
|
|
|
# Assign the event source: through this phase the event source will be defined form which node is this event coming from in order to put this event within its prober counter category.
|
|
|
|
# Check the event Category: the event that will be received will be checked for its event type wither it is a status, start, end iteration or end process event, and according to that the next step will be defined
|
|
|
|
# Increase status counter: if the received event is a status event, the counter defiend for receiving from specific node will be increased and the function will return back to the “receiving events phase”.
|
|
|
|
# Save the Time stamp(Start Event) : if the event that was received is a start event, the time stamp for the begging of start receiving from a specific node will be restored in order to be saved in “saving results phase”.
|
|
|
|
# Increase loop counter (start Event ):the number of receiving events loop counter will be increased once a starting event received .
|
|
|
|
# Reset the event status counter: the event status counter will reset, in order to start new receiving events loop.
|
|
|
|
# Check end Event: if the received event marked as “end event”, it will be checked if it is end process or end loop event, and according to the results the next step will be defined.
|
|
|
|
# Save the time stamp (end loop event) if the event received is “end loop event” , the time stamp of “ending events receiving loop” will be recorded to be saved within the results in saving results phase for processing.
|
|
|
|
# Save Results (end loop event): if the received event marked with end loop event, the results of “receiving events loop” will be saved in a log file, and the function will return back to receiving events phase.
|
|
|
|
# Check if all sending nodes finished: if the received event marked as “end process event”, the function will check if all the sending nodes finished its process or not, if not it will return back to the receiving event phase, otherwise it will continue to the next step.
|
|
|
|
# Wait (finished): if all the nodes that sending the events finished their task, the function will wait for predefined time period in order to clean the bus from events.
|
|
|
|
# Exit all: at this phase the function will end all the processes that are running within the whole program and shutdown the framework.
|
|
|
|
|
|
|
|
* [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/11.png Framework Launch script]
|
|
|
|
In order to run the OSGI framework and the Felix framework without the need of running Eclipse SDK, a modified pax runner which can be executed through a terminal command were can be used. Executable script will call a Felix file to launch the pax runner, this will start the frameworks required for the universAAL platform, start all the bundles and applications related to the platform, and will a start the Fault Injection Framework
|
|
|
|
|
|
|
|
# Initiate OSGI frame work: the script starts by defining the starting level of the osgi at which all the required bundles will be activated and ready. Also the execution environment will be defined including all the required applications.
|
|
|
|
# Initiate Felix framework: at this stage the starting level of the Felix framework and the Felix sittings will be defined.
|
|
|
|
# Initiation process: the required applications for running the Felix framework will be initiated and activated.
|
|
|
|
# Join the bus: this function will activate the upnp driver and will join the system bus.
|
|
|
|
# Activate the Middleware : all the middleware bundles that are required for the universAAL platform will be installed and activated.
|
|
|
|
# Activate the Ontologies: the univesAAL platform ontologies bundles will be installed and activated.
|
|
|
|
# Activate the application: the Fault Injection Framework will be installed and activated.
|
|
|
|
# Start the framework: at this level all the requirements for the Felix framework are activated and ready, it will start.
|
|
|
|
|
|
|
|
* [http://forge.universaal.org/mediawiki/images/thumb/9/9e/12.png/400px-12.png Server Script]:
|
|
|
|
The server will control the Fault Injection Framework; it will start the script and will trigger application tasks synchronously on the nodes, by the end of this script it will collect the results.
|
|
|
|
|
|
|
|
# Send the Task: the server’s script starts by sending the assigned task to the nodes.
|
|
|
|
# Run the task in the node: the server side script will start the node’s script inside the nodes themselves.
|
|
|
|
# Collect the Results: after finishing the task the server will collect the results that are sent form the nodes and organize them in one results folder.
|
|
|
|
|
|
|
|
* [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/13.png Node Side Script]
|
|
|
|
At the nodes side, a script is required in order to execute some essential functions to start the framework.
|
|
|
|
|
|
|
|
# Update the Task: the script starts with updating the task, which is sent from the server, from the node’s specific folder.
|
|
|
|
# Execute the task: at this stage the script executes that task assigned to it by triggering the application.
|
|
|
|
# Send results: during the task execution the results will be saved on the local node, and after finishing the task execution the script will send these results to the server.
|
|
|
|
# Clean Results files: once the results were sent to the server, the node will delete the old results to be prepared for the next task.
|
|
|
|
|
|
|
|
==Artefact #4 : Time Triggered Ethernet Extension Module ==
|
|
|
|
|
|
|
|
=== Blackbox Description ===
|
|
|
|
|
|
|
|
uAAL environment contains large variety of embedded devices that need to communicate with each other to achieve a specific AAL service. The delivered AAL services differ in the degree of impact on the end user, i.e. safety-critical services that deals with user’s health and emergency systems in a uAAL space should be correctly delivered to achieve its targets of high reliability, consequently the different devices that cooperate to deliver such services should support real-time communication and fault-tolerance mechanism to provide a reliable service. Real-time communication means that it is not enough for the network artifact to receive correct response in value domain but also this response should be correct in time domain.
|
|
|
|
|
|
|
|
According to the mentioned facts above , dealing with AAL environments mean dealing with different scenarios with different technical requirements regarding real-time communication. Consequently, different communication networks, protocols, and services could be used to cover all of these requirements. Here, TTEthernet comes to reduce the gab and combine the networks with different criticality into one network. Time-triggered services inspired by Time-Triggered Protocol <ref> The time-triggered architecture. Kopetz, H., and Bauer, G. s.l. : IEEE Special Issue on Modeling and Design of Embedded., 2003.</ref> with Ethernet flavor and standard IEEE 802.3 Ethernet protocol are combined in one Ethernet network. Time-triggered services provide temporal firewall, i.e. no message will be transmitted or received at wrong time, and thus, this issue will facilitate the use of fault tolerance techniques to add more reliability to the services.
|
|
|
|
|
|
|
|
=== Bundles ===
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" colspan="2" | Artifact: '' Time Triggered Ethernet Extension Module ''
|
|
|
|
|-
|
|
|
|
| GIT Address
|
|
|
|
| http://forge.universaal.org/svn/support/trunk/reliability/TTE%20Extension
|
|
|
|
|-
|
|
|
|
| Javadoc
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
| Design Diagrams
|
|
|
|
| [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt7.png TTEMessageAction()], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt8.png processBusMessage()], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt9.png TT-Messages transmutation native method], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt10.png TTEListener package], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt11.png nativeTTEMsgListening], [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt13.png nativeTTEMsgFetching]
|
|
|
|
|-
|
|
|
|
| Reference Documentation
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
|}
|
|
|
|
|
|
|
|
=== Requirements ===
|
|
|
|
* '''RC9_TR3''' ''Correctness-by-Construction''
|
|
|
|
* '''RC9_TR4''' ''Delay/Disruption-tolerant''
|
|
|
|
* '''RC9_TR5''' ''Communication Resource''
|
|
|
|
* '''RC9_TR6''' ''Unreliable Components''
|
|
|
|
* '''RC9_TR10''' ''Consistent membership Service''
|
|
|
|
* '''RC9_TR11''' ''Generic Fault-tolerance Layer''
|
|
|
|
* '''RC9_TR13''' ''Bounded Start-up and Restart Time''
|
|
|
|
* '''RC9_TR15''' ''Pre-emptive Resource Allocation''
|
|
|
|
|
|
|
|
=== Features ===
|
|
|
|
This uAAL based extension module provides Time-triggered temporal guarantees and facilitates the use of fault tolerance specifications to support the reliability of the uAAL system.
|
|
|
|
|
|
|
|
=== Design Decision ===
|
|
|
|
|
|
|
|
One of the main targets in universAAL project is to make the AAL services more reliable and safe, by adding real time communication capabilities for the nodes. Because of that, adding real time communication infrastructure like TTEthernet will motivate the AAL services developers to innovate reliable AAL services by providing reliable platform. Therefore, the implementation of the Time Triggered Connector will be separated into two main phases:
|
|
|
|
|
|
|
|
# The implementation on the Time Triggered Ethernet patch
|
|
|
|
# Creation of an independent Time Triggered Ethernet connector
|
|
|
|
|
|
|
|
In this version of the deliverable we will present the first phase, the second phase will be included in future versions.
|
|
|
|
|
|
|
|
==== What is Time Triggered Ethernet(TTEthernet) ====
|
|
|
|
===== Overview =====
|
|
|
|
During the past decades, Ethernet is represented as the most successful local area network of the world. Because of the event triggered communication related with Ethernet and its open nature, it became difficult to talk about strict temporal properties within Ethernet. Nevertheless, many projects tried to adapt Ethernet with time critical applications (e.g. ARINC, ProfiNet). Due to demands for a unified communication architecture to cover both real time and non-real time application TTEthernet appeared for the first time in Vienna University as an academic project. TTEtherne can be considered to be a unification of the best properties of standard Ethernet and TTP/C.
|
|
|
|
|
|
|
|
TTEthernet provides seamless communication for a wide range of networks by using Ethernet. Different applications with different degree of criticality regarding safety can be combined in one network with full compatibility with the IEEE Ethernet 802.3 standard. Since TTEthernet uses time triggered services, several features like temporal partitioning, precise diagnosis, efficient resource utilization, and composability can be added to the system. Since several applications with different requirements can be applied in AAL environment, it’s very powerful if these applications are unified under a unique network. Moreover, features like temporal partitioning will support the reliability in such systems.
|
|
|
|
|
|
|
|
===== Basic concept of TTEthernet protocol =====
|
|
|
|
As shown in the next figure the TTEthernet network consists of end systems which contain the host application and switches which organize the different traffic available in TTEthernet. The synchronization between nodes is important to exchange the messages within time triggered traffic. Regarding synchronization, different functions are assigned to TTEthernets’ parts (switches and end-systems).Synchronization Master(SM) function is assigned only to the nodes (e.g. SM1, SM2 and SM3 ) for their clocks other nodes should subject. The main function for switches in TTEthernet network is Compression Master, where the switch collects the local clocks of SMs, combines them and retransmits the combined clock to the SMs and SCs. Both of end systems and switches could be Synchronization Client (SC).
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt1.png| Typical TTEthernet network |500px|center]]
|
|
|
|
|
|
|
|
TTEthernet communication protocol provides three communication modes (communication traffics). The frames of all these modes are exchanged through on integrated physical network. The main traffic in TTEthernet protocol is Time-Triggered traffic which is the best representative for time critical applications. The messages through this traffic should transport at a restricted point of time according to a static schedule. If an end system decide not to use the dedicated timed slot, then the switch sense the inactivity and frees the bandwidth to be used by other traffics.
|
|
|
|
The second traffic type which is used by TTEthernet protocol is Rate-Constrained traffic. This traffic is specified when a less demands of real time and determinism are required by the application. For transporting messages through this traffic, sufficient bandwidth should be allocated such that delays and temporal deviation have limits. Since messages exchange through RC traffic doesn’t subject to the synchronization between nodes, it is possible that some nodes send its RC messages simultaneously. To get rid of delays resulted from this scenario, the transmission rate of RC messages should be known, and thus the upper bound of the transmission latency could be calculated off-line. If there is neither TT nor RC Messages reserve the bandwidth, Best Effort messages can take its way through TTE protocol.
|
|
|
|
|
|
|
|
===== Selection of TTEthernet artifacts =====
|
|
|
|
To setup a new TTEthernet network two types of artifacts are needed which are; switch and end-systems. A variant of FPGA-based switches which are developed by TTETech Computertechnik AG, are now available, These FPGA-based solution of TTEthernet switches differ between each other with respect to the communication speed they supports. A 100 Mbit/s FPGA based switch has been selected for the development model. Regarding end-systems, TTTech provides two types of them:
|
|
|
|
# FPGA-based TTEthernet end-systems which are characterized with high speed and high capabilities regarding real-time communication and fault tolerance.
|
|
|
|
# Software-based TTEthernet end-system uses a software stack which is called TTE Protocol Layer. The software-based end system is based on COTS hardware and suitable for a broad range of Ethernet applications, such as real-time control applications, data acquisition or multimedia applications. The software stack is also supported to work under operating system
|
|
|
|
To facilitate the development work, and to set up both of uAAL platform and network configuration under single operating system, software based TTEthernet end-system has been selected.
|
|
|
|
|
|
|
|
==== TTE Protocol Layer ====
|
|
|
|
Any hardware platform has a timer interrupt mechanism and Ethernet controller can host TTE Protocol layer which in turns can run on different Operating Systems. The protocol layer uses the dedicated hardware platforms and the operating system to perform the TTEthernet protocol which involves:
|
|
|
|
*Transmission and receiving of Synchronization frames.
|
|
|
|
*Transmission of Time-Triggered and Best-Effort Messages.
|
|
|
|
*Time-Triggered reception for both TT and BE messages.
|
|
|
|
*Time-Triggered execution of application tasks.
|
|
|
|
|
|
|
|
===== Construction of TTE Protocol Layer =====
|
|
|
|
The TTE Protocol Layer consists mainly of three basic elements as shown in the next figure
|
|
|
|
*TTEthernet core, which plays the coordinator’s role between the hardware drivers (Network Interface Card (NIC) and timer) from one side and the application-specific configuration files from the other side to handle the execution of TTE protocol.
|
|
|
|
*NIC Driver provides a low level access to a network card. Depending on the application specific configuration requirements, the NIC deriver will adapt the network card to enable the communication node from communication among each other using TTEthernet protocol.
|
|
|
|
*Timer Driver provides a free running timer with programmable interrupt to the TTEthernet core. In fact, it takes its orders from the time actions that are predefined in each end-system configuration file. The configuration files contains time actions related to the communication actions like clock synchronization actions, sending and receiving TT and BE messages actions.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt2.png| Simple TTEthernet End-System construction|300px|center]]
|
|
|
|
|
|
|
|
TTEthernet core together with the hardware drivers, an application-specific configuration and optionally a set of application tasks, will corporate and compile to form a single kernel module (.ko file). The resulted module provide suitable interface to the user-space application by using the network devices and provide another interface for tasks which are made under kernel by using TTEthernet API.
|
|
|
|
|
|
|
|
=== Implementation ===
|
|
|
|
==== Initial implementation from selected input projects ====
|
|
|
|
There were no initial implementation from the input projects, where the need of such extension were driven from the leak of reliability and fault tolerance support in those projects.
|
|
|
|
|
|
|
|
==== Implementation Model ====
|
|
|
|
As depicted in the following figure, the model, used in this work, is constructed of 5 communication nodes, TTEthernet switch, Ethernet switch and server PC. The server has been used for generating the configuration files for the network and transferring these files to the corresponding communication nodes through Ethernet network by using Ethernet switch. Each communication node has been configured to have RT-Linux as a real time operating system. On the top of operating system two basic elements have been installed:
|
|
|
|
# UniversAAL platform.
|
|
|
|
# TTE Protocol Layer.
|
|
|
|
Since the schedule of Time Triggered events should be fixed during the operation, each node should have static configuration. In our case each node has been configured to send one TT message and receive 4 TT message from each other node, through one cluster cycle. Despite of the synchronization message and BE messages. The next figure shows the time line of node1 that send one and receive 4 TT messages.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt3.png| Time schedule for node 1 through one cluster cycle |500px|center]]
|
|
|
|
|
|
|
|
===== Compilation of the used Configuration =====
|
|
|
|
The elements of TTE Protocol layer are then compiled with the static configuration file related to each node to form at the end single kernel module. As shown in the next figure, the typical node has two NICs (eth0, eth1). eth0 channel has been chosen to exchange the messages through TTEthernet network. . Before inserting the new kernel module inside the kernel of Linux operating system, eth0 should be disabled. By inserting the new kernel module, the physical network interface (eth0) will be replaced by six logical network interfaces, 5 of them are dedicated for exchanging TT messages while the last one is used for exchanging messages on BE traffic.
|
|
|
|
|
|
|
|
One point should be mentioned, that each configured TT message should have a unique name which is called Virtual Link Identification (VLID). For example, TT message of node1 has been called (101), for node 2 (102) and so on. It can be noticed from previous shown figure, that the names of logical network interfaces are derived from the corresponding (VLID).
|
|
|
|
|
|
|
|
==== Putting into practice ====
|
|
|
|
P2PConnector is an essential part in universAAL platform to provide the seamless connectivity between different middleware instances in such a way that each peer can discover dynamically the services offered by other remote peers. P2PConnector is already implemented under ACL layer, which is responsible of creating peering functionality among the distributed nodes. Two discovery protocols are already implemented under ACL i.e. two types of P2PConnector are available by universAAL platform:
|
|
|
|
*UPnP P2PConnector
|
|
|
|
*R-OSGi P2PConnector
|
|
|
|
Both of these technologies have discovery capabilities i.e. the discovery protocol is already provided by these technologies. Therefore, and in order to create a new P2PConnector uses TTEthernet service, it’s necessary to provide TTEthernet service with discover protocol (e.g. SSDP as in UPnP).
|
|
|
|
Because of these difficulties, it’s decided to implement TTEthernet patch under UPnP P2PConnector, in other words all the discovery functions will be left to UPnP Connector to do them, while the exchanged messages will be sent by using TTEthernet patch. Figure shows the available P2PConnectors of the middleware and how TTE-patch is connected to UPnP connector. For more information of how UPnP-connector is created.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt5.png| Inserting TTE Patch within uAAL middleware |400px|center]]
|
|
|
|
|
|
|
|
UPnP-Connector and the other P2PConnectors in middleware can access the SodaPop layer and create a new local instance through two interface declarations; PeerDiscoveryListener and SodaPopPeer.
|
|
|
|
PeerDiscoverListener interface has two methods:
|
|
|
|
*noticeNewPeer
|
|
|
|
*noticeLostPeer
|
|
|
|
UPnP-Connector invokes these methods to notify all locally registered PeerDiscoveryListeners about the existence of new peer or lost of existed peer.
|
|
|
|
SodaPopPeer interface has the following methods:
|
|
|
|
*joinBus
|
|
|
|
*leaveBus
|
|
|
|
*noticePeerBuses
|
|
|
|
*replyPeerBuses
|
|
|
|
*processBusMessage
|
|
|
|
The first four methods are used by the local UPnP connector to create the SodaPopPeer local instance. This local instance is used also by the last method (processBusMessage) to transfere the message from the remote SodaPop peer to the local Peer. As a first step in developing TTE patch, only processBusMessage method has been selected to be invoked by TTE patch while the first 4 methods have been left to UPnP-Connector.
|
|
|
|
In order to understand how processBusMessage will be invoked over TTE Patch, it’s important to understand how the other methods can be invoked over UPnP-Connectors. Suppose two uAAL nodes both of them have created UPnP-Connector. Each Connector will discover its partner and create a proxy object for the remote instance. If node 1 wants to invoke a method from the remote peer, then the steps as shown in the next figure should be followed:
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt6.png| method invoke using UPnP-Connectors |500px|center]]
|
|
|
|
|
|
|
|
#SodaPop instance of node1 calls the required method from Proxy 2.
|
|
|
|
#Proxy 2 forwards the message in serialized form to the UPnP-connector of node 1.
|
|
|
|
#The message is transferred then from UPnP-Connector of node 1 to its partner in node 2.
|
|
|
|
#The UPnP connector in its turn, de-serialize the message, drive the intended method with its parameter, then call the same method with parameters from the SodaPop instance which is registered locally to it.
|
|
|
|
|
|
|
|
In the same way, step 2 will be changed by forwarding the message of processBusMessage() method to TTE Patch instead of UPnP-Connector. To achieve this process several points should be taken into consideration:
|
|
|
|
#Since TT message will be broadcasted on the network, the destination ID should be included within the message.
|
|
|
|
#TTE network doesn’t recognise the SodaPopPeer ID i.e. only Virtual Link Identification (VLID) would be recognised.
|
|
|
|
#Keeping on an updated image of SodaPop instance, enabling the new received message to access the SodaPop layer.
|
|
|
|
#Creating an algorithm to send the messages on TTE network with making the required changing on the message arguments.
|
|
|
|
#Building an algorithm to receive the TT-messages with doing all the required processing in order to submit the messages in the same way as by UPnP connector.
|
|
|
|
|
|
|
|
===== Coupled ID Protocol =====
|
|
|
|
For managing the first two aspects, A new class named TTEMsgHandling has been created under upnp.importer package for executing the new protocol (Let call it Coupled ID-protocol), which is described as follow:
|
|
|
|
When starting UPnP bundle, the Activator class, registers this new connector within OSGi registry service and open listener to listen to the other UPnP-Connector. Before registration function, a new job has been added to Activator class. This job is summarized in sending a new TT-message from the hosted node to all other node in the network if existed. The new TT-Message consists of two strings separated by a special sign “|”, the first part is VL-ID for the transmitted node while the last part is the middleware instance ID of the same node. This message will be saved as a static string in TTEMsgHandling class, then it’ll be sent by calling TTEMsgHandling.sendIdMsg(String msgMod).
|
|
|
|
On sendIdMsg(String msgMod) class, another string part has been added to the message to be as follow;
|
|
|
|
|
|
|
|
msgMod|VL-ID|middleware_instance_ID
|
|
|
|
|
|
|
|
this type of message can be sent under two different modes, nul/one depending on whether the message has been transmitted as a request or as a response as follow:
|
|
|
|
*If sendIdMsg() class has been invoked by UPnP Activator class, then the new node sends the message with mode nul, i.e. it sends request to all activated nodes and says this is my coupled ID, please send me back your coupled ID
|
|
|
|
*If there is at least one node receives the previous message, then it’ll recognize the message from its mode as a request for its coupled ID. The receiver node will split the mode part and save the remote coupled ID in a matrix of string called remotePeersId which has been identified under TTEMsgHandling class. After splitting the mode part and saving the remote coupled ID, the mode is tested, in our case the carried mode is nul, so the message will be interpreted as request and the host node will send back its coupled ID with mode one as a response to be saved by the remote partner.
|
|
|
|
Consequently, both nodes introduce each other and then, can exchange the messages carried on the buses of their middleware instances.
|
|
|
|
|
|
|
|
===== Saving an updated Image of middleware local instance =====
|
|
|
|
In order to receive an message in the same way as processBusMessageAction() class is doing, an identical class, named as TTEMessageAction(), has been created. This class could really receive identical message with the same input arguments, but the question how can this class access to SodaPop layer. A SodaPopPeer local instance is needed.
|
|
|
|
Based on the assumption that each OSGi framework will host only one SodaPop instance, one instance could be saved somewhere and reused by this class whenever a new message is received. But the SodaPopPeer instance is not static object, i.e. it may be changed dynamically depending on the whole AAL space. For example, at certain instance a communication node is added/removed also within one node one or more uAAL-aware component may join or remove from a certain bus. Because of that the local instance image should be resaved dynamically.
|
|
|
|
|
|
|
|
Three classes from UPnP package are used to keep on one updated image of the local instance as follow:
|
|
|
|
*NoticePeerBusesAction()
|
|
|
|
*JoinBusAction()
|
|
|
|
*leaveBusAction()
|
|
|
|
A new instance of [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt7.png TTEMessageAction()] with a middleware local instance as an input argument, has been created whenever one of the above classes is invoked by UPnP connector. The class diagram below describes this operation.
|
|
|
|
|
|
|
|
===== TTE Transmitting Algorithm =====
|
|
|
|
When a specific middleware instance has a message in one of its buses, and want to send this message to the identical bus of a remote instance, than it calls [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt8.png processBusMessage()] function from sodaPopProxy class of the related remote instance. Exactly at processBusMessage() method will be crossroads, i.e. the message tried to be transmitted through TTEthernet network interface by calling TTEMsgHandling.main(String [] sentMsg), if the sending process is done successfully, the calling of method above return 1, otherwise it returns 0. Then, the returned value is tested, if the returned value is 1 then the message will not be transmitted another time by UPnPConnector, otherwise the sending through TTEthernet service is failed and the message should be transmitted by UPnPConnector.
|
|
|
|
|
|
|
|
In addition to the original input arguments (busName, msg), two other arguments will be added by main (String [] sentMsg) function, first of all the address (VLID) of the destination node, and the mode of the message. All of the input arguments have been concatenated together in one string to be as shown below:
|
|
|
|
msgMod|TTEId|busName|msg
|
|
|
|
Since the message modes “nul” and “one” have been reserved for exchange messages within Coupled ID-protocol, the mode of this type of messages is “two”.
|
|
|
|
Since the source code of universAAL platform has been done in Java and TTE protocol is done in C language, Java Native Interface (JNI) has been used in order to exchanging data between the native method which is written in C and the Java code. Because of that, the processed message will be delivered though JNI to a native method which is responsible of broadcasting the message on the TTE network. The first diagram in the next figure describes all the processes that happen to the message within main () function until the message is delivered to the native method.
|
|
|
|
The native method algorithm, as shown in next figure (second diagram), will receive the input argument of type jstring from Java class. This type cannot be recognised by C, in other word it should be converted to a recognisable form. Ethernet header will be added to the converted message to distinguish it as TT message not as BE message. In total the message size is 1514 bytes, which is the maximum message size can be transmitted on TTE protocol layer , it is also identical to the message size as set in the configuration. When the message size from java class is more than 1500 byte, then the message is transmitted in two bunches or more. After preparing the message, a ROW-SOCKET will be opened to send finally the message on TTE network. The implemented [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt9.png TT-Messages transmutation native method] is a java method responsible for transmitting TT-messages to the native JNI class.
|
|
|
|
|
|
|
|
===== TTEListener package =====
|
|
|
|
A separate maven-Java project named TTEListener has been created to include all the classes assigned to achieve the listening job on TTEthernet channels. This project has been built as universAAL application within eclipse. Creating of a universAAL application means a new OSGi bundle has been created, thus an Activator class has been generated automatically. In addition to Activator class, the package includes TTMsgListening and TTEMsgFetching java classes.
|
|
|
|
Two native methods have been invoked under these classes, the first one which is invoked from TTEMsgListening has been used to initiate the listening process from TTEthernet channels and save the received message in a FIFO queue, the second native method is invoked by TTEMsgFetching and is responsible for fetching the already saved messages from the queue. After dragging one message, the message is returned to TTEMsgFetching class to complete the processing on it. The flow diagram, describe how each one of these classes are interacting in this [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt10.png TTEListener package].
|
|
|
|
|
|
|
|
The main job of Activator class in such design is to trigger the listening processes in both TTEMsgListening and TTEMsgFetching, and not to wait the return value of each process. By initiating TTEMsgListening class, an individual thread is created to invoke nativeTTEMsgListening which in turns will open sockets for each TT-channel and listen for that channel. Additionally, TTEMsgListening create a new instance of TTEMsgFetching which will also create a separate thread to do the fetching job, in fact this thread enter an infinite loop, in each loop invoke its native class (nativeTTEMsgFetching()) and wait for fetching a new message, after getting a message from the native class,the message is submitted to another class for further processing while the loop invoke the native class in next iteration.
|
|
|
|
Both of native classes (nativeTTEMsgFetching and TTEMsgListening) have involved in one dynamic library named libJniListener.so.
|
|
|
|
|
|
|
|
''nativeTTEMsgListening''
|
|
|
|
|
|
|
|
This native method has been invoked by TTMsgListening class, neither input nor output argument are needed to call this native method since the main object of invoking this class is just to create a RAW-Socket for each TT-channel and listen to that socket. For our cluster model use case, each node should open four listening channels, one channel for each remote node. Thus, [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt11.png nativeTTEMsgListening] triggers four threads and return nothing. The same tasks have been achieved in all of these threads. The flow chart seen in the next figure shows the main tasks for listening thread.
|
|
|
|
|
|
|
|
The listening process begins with opening a RAW-Socket to listen on it. The same data structure have been used at the sending side will be used also at the listening side. When the message is received, several filtration processes are done on it, the first process is to check the size of the received message, when its size doesn’t equal the size of the sent message, then an error message is printed out and the algorithm flow go back to receive another message. If the message size is identical, then the message is represented as correct message. Here, another check process comes to check the completion of the message. A certain sign “||” has been put at the end of each transmitted message, if this sign hadn’t be seen at the end of the received message, then this part of the correct message will be concatenated in a string pointer, otherwise the message is completed.
|
|
|
|
When the received message complete, the message mode should be detected. As mentioned before, three mode have been used to send messages on TT-channel, mode “nul” and “one” has been used for messages that carries IDs information while mode “two” has been used for exchanging messages upon buses. The first two types of messages have been transmitted without dedicating a destination address within it, so there is no need to check to which node this message has been transmitted. Since all nodes will receive the TT-message transmitted from one node (as set in cluster model configuration), the message with mode “two” should be classified according to the destination address that it carries, if the destination address match the address of TT listening channel, than the message will continue in processing otherwise the message will be ignored.
|
|
|
|
The message finally will take its final form and saving in a round FIFO queue as shown in the next figure, the queue has been identified as global variable, such that it can be accessed by fetching native method.
|
|
|
|
|
|
|
|
''nativeTTEMsgFetching''
|
|
|
|
|
|
|
|
This native method has been invoked in an individual thread from TTEMsgListener java class. The main target of this method is to fetch the already received messages through TT-channels. The fetching algorithm ([http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt13.png nativeTTEMsgFetching]) has been clarified in this flow diagram.
|
|
|
|
|
|
|
|
The process begins with an infinite loop, at the top of this loop, the queue status is checked whether the queue is leer or not. In case of leer queue, a new pull method is invoked, waits for a certain time and repeats the loop. The algorithm exists out of the loop when a success pull process occurs.
|
|
|
|
In order to have a holistic view for the whole project, a class diagram for all classes developed during this thesis and their relationship to other classes in UPnP connector package, has been created. The class diagram shows the relationship among these classes in rational sequence to describe the main three functions of this development work:
|
|
|
|
*Exchanging coupled IDs.
|
|
|
|
*Transmitting TT-messages.
|
|
|
|
*Receiving TT-messages.
|
|
|
|
The Activator class of UPnP-Connector begins the process by invoking sendIdMsg() method from TTEMsgHandling class which forward the message to TTEthernet network by invoking the native method. From other side TTEMsgFetching class invokes its native method to receive three types of messages, two of them carry ID information which are forwarded to the TTEMsgHandling to save the message there and to reply the the ID request in case of “nul” mode. The third type of messages which represent a uAAL message is forwarded to TTEMsgAction class where an updated image of the local instance resident there, by using the local instance the message can now be forwarded to SodaPop.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Tt14.png| Class diagram describing the uAAL-TTEthernet Interface implementation |600px|center]]
|
|
|
|
|
|
|
|
==Artefact #5 : Fault Tolerance (Replication) Module ==
|
|
|
|
|
|
|
|
=== Backbox Description ===
|
|
|
|
|
|
|
|
Replication or redundancy of components is the creation of replicas of the system components aiming to increase the reliability of the system. This can be achieved by using the fault tolerance form of N-modular redundancy. In the case of the Triple-modular Redundancy (TMR) –also known as Triple Mode Redundancy - the occurrence of a faulty component can be out-voted by the other two remaining components. Furthermore, the reliability of a system can be enhanced to run with the minimum allowed failure rate by arranging and implementing redundancy in systems’ instances and having hardware system replication. In this way the failure that occurs to a single replica of the system does not impact the reliability of the overall system. Another method of redundancy is the Double Module Redundancy (DMR) where in case of node duplication, the second working node will cover the failure of the first node. To decide our redundancy management approach we need to conduct a “root Cause Analysis” to find the reason of each failure occurrence and use cases that will be covered by us. This analysis has been done and extended from the one that has been done in the diagnosis framework and the fault hypothesis (see Diagnosis Framework).
|
|
|
|
|
|
|
|
There are two functions of redundancy to prevent performance failure from exceeding acceptable performance limits:
|
|
|
|
*Active redundancy: Ensure performance by tracking each component individually and in the meanwhile it uses this monitoring to implement voting logic. This voting mechanisms switch in between components and reconfigure components accordingly. Examples of voting in redundancy management logic are the following:
|
|
|
|
**Error detection and correction.
|
|
|
|
**Data radio selection in aircrafts.
|
|
|
|
**Global Positioning System (GPS)
|
|
|
|
|
|
|
|
*Passive redundancy: Performance decline is commonly connected to this passive redundancy; it provides simple features while maintaining the basic functionality by excessing capability to reduce the impact of component failures
|
|
|
|
|
|
|
|
=== Bundles ===
|
|
|
|
{| border="1" style="cellspacing=0; bordercolor=gray; align=left; valign=top;"
|
|
|
|
! align="left" bgcolor="#DDDDDD" colspan="2" | Artifact: '' Fault Tolerance (Replication) Module ''
|
|
|
|
|-
|
|
|
|
| GIT Address
|
|
|
|
| [https://github.com/universAAL/middleware/tree/master/middleware.core/mw.reliability.redunduncy HW Redundancy(TMR)], [https://github.com/universAAL/middleware/tree/master/middleware.core/mw.reliability.EventDuplication Event Duplication]
|
|
|
|
|-
|
|
|
|
| Javadoc
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
| Design Diagrams
|
|
|
|
| [http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/EventDuplication.png HW Redundancy(TMR)],[http://forge.universaal.org/wiki/https://raw.githubusercontent.com/wiki/universAAL/middleware/Redundancy.png Event Duplication]
|
|
|
|
|-
|
|
|
|
| Reference Documentation
|
|
|
|
|
|
|
|
|
|-
|
|
|
|
|}
|
|
|
|
|
|
|
|
=== Requirements ===
|
|
|
|
|
|
|
|
* '''RC9_TR2''' ''Design for Testability,''
|
|
|
|
* '''RC9_TR3''' ''Correctness-by-Construction.''
|
|
|
|
* '''RC9_TR5''' ''Communication Resource Guarantees.''
|
|
|
|
* '''RC9_TR12''' ''Tolerance of Software Errors.''
|
|
|
|
* '''RC9_TR23''' ''Different Levels of Reliability.''
|
|
|
|
* '''RC9_TR24''' ''Handling of Changing Reliability.''
|
|
|
|
* '''RC9_TR25''' ''Replication.''
|
|
|
|
* '''RC9_TR26''' ''Replica Determinism.''
|
|
|
|
|
|
|
|
=== Features ===
|
|
|
|
Provide replication and voting mechanisms for the uAAL application and message transmitted in the uAAL space for error detection and error masking.
|
|
|
|
|
|
|
|
=== Design Decisions ===
|
|
|
|
|
|
|
|
For the further enhancement of the Fault tolerance of components of the universAAL communication platform, replication of nodes and messages for error detection and voting mechanisms by using a Triple Modular Redundancy for error masking. Redundancy with Triple Modular Redundancy (TMR) provides fault tolerance against component failures. TMR will be able to cover and hide faulty nodes and it can overcome faults created in between Fault Containment Regions. For example, handling a single point of failure where if a part of a system fail, this failure will stop the entire system from working. Furthermore, hiding the detected errors will be also done by Event Duplication Redundancy. In this case all the following cases can be easily handled: operational faults of communication system, value faults, transient and temporal faults e.g. late timing faults.
|
|
|
|
|
|
|
|
=== Implementation ===
|
|
|
|
==== Initial implementation from selected input projects ====
|
|
|
|
There were no initial implementation from the input projects.
|
|
|
|
|
|
|
|
==== Implementation Plan ====
|
|
|
|
|
|
|
|
*'''Redundancy with Event Duplication''': In the case of event duplication, there are two new interfaces that should be added to the nodes. The first one is the one responsible for duplicating each event leaving the node, An new event duplication Publisher that is inherited from the initial platform publisher is taking over the event and creating the replicas and send them to the context bus. The most important in this redundancy of events duplication is the duplication Subscriber implementation, the duplication voter uses the Result predefined class to determine the status of the received events by checking within a predefined time out the contents of the event for transient and operational faults, also checks for temporal behavior of the received events are deployed.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/EventDuplication.png|500px|center]]
|
|
|
|
|
|
|
|
*'''Hardware redundancy (Triple Modular Redundancy)''': As described earlier in this section Triple Module Redundancy is performed on hardware level. Three replicas of the same node that runs identical functions should run the TMR_Publisher, those nodes will send the same copy of events on the context bus. The TMR voter will be able to collect the duplicated messages and make a decision regarding the accuracy of the events and the desired operation of the redundant nodes. The so called RedSubscriber in the implemented TMR perform a replicas voter, the voter logic is simple: it perform monitoring to determine how to reconfigure components’ outputs so that the operation of the system continues without any violating of the operational and functional limits of the overall system. In another words, the voter establishes majority choice between available replicas, when it has two identical replicas at least, when there is disagreement it will drop the faulty choices from the voter because a single fault will not interrupt the whole system operation. The TMR perform continues timeout check counter for the replicas and voter in order to control the temporal violation of the TMR itself so it can stop the voter any time in case of no decision or delays of no reason.
|
|
|
|
|
|
|
|
[[https://raw.githubusercontent.com/wiki/universAAL/middleware/Redundancy.png|500px|center]]
|
|
|
|
|
|
|
|
== References ==
|
|
|
|
<references/> |