This section describes advanced monitoring features, such as how to derive information, naming services, and using monitoring to implement a fault tolerant architecture.
With your standard SmartSockets product, all you have to do is hook into a project, and all monitoring information for all processes is available. However, if you want to use SNMP products to monitor SmartSockets processes, you can purchase our SNMP support module, SmartSockets Monitor. For more information, see your TIBCO sales representative or the TIBCO SmartSockets Monitor User’s Guide.
In many situations, you may wish to derive (calculate) new information from the information received. This allows the process accessing the monitoring information to calculate any derived information it finds useful. One example of this is calculating the change in memory usage of an RTclient since the last poll. For example, the RTmon GDI calculates the change in memory usage from the last poll.
Each RTserver and RTclient has an identification string that is used as a descriptive name for the process when it is being monitored. This string shows up in RTmon and is also used as part of a field in these message types:
In the above message types, the full field has the form "ident: user@node" (such as "RTclient: ssuser@workstation1"). The monitoring identification string is used as ident. This string is retrieved with the function TipcMonGetIdentStr and set with the function TipcMonSetIdentStr, or the Monitor_Ident option. This example shows how to set and access the identification string of a process:
T_STR ident_str; if (!TipcMonSetIdentStr("Acme Inc. Data Collector")) {/* error */
} if (!TipcMonGetIdentStr(&ident_str)) {/* error */
} TutOut("Monitoring identification string is %s\n", ident_str);
An RTclient that calls TipcMonSetIdentStr after calling TipcSrvCreate will not be identified correctly.
SmartSockets monitoring provides an elegant way to implement naming and directory services within a project. Using monitoring combined with the publish-subscribe features of SmartSockets, it is very easy to find items of interest on a network. In general, RTclients are identified by the setting of their Unique_Subject option, and groups of RTclients can be identified by subscribing to a specified subject.
The monitoring functions that provide naming or directory services includes:
These functions, combined with other functions in the monitoring API, can be used to locate RTclients based on:
For example, to locate what node an RTclient, identified by the unique subject program_x
, resides on, this example is used:
T_IPC_MT mt; T_STR client_name; T_STR ident; T_STR node_name; T_STR user_name; T_INT4 pid; T_STR project;/* send the poll request out to RTserver */
if (!TipcMonClientGeneralPoll("program_x")) {/* error */
} mt = TipcMtLookupByNum(T_MT_MON_CLIENT_GENERAL_POLL_RESULT); if (mt == NULL) {/* error */
}/* wait up to 10 seconds for the poll result */
msg = TipcSrvMsgSearchType(10.0, mt); if (msg == NULL) {/* error */
}/* access the fields of interest from the returned message*/
if (!TipcMsgSetCurrent(msg, 0)) {/* error */
}/* note that we do not have to access all the fields of the message */
if (!TipcMsgRead(data->msg, T_IPC_FT_STR, &client_name, T_IPC_FT_STR, &ident, T_IPC_FT_STR, &node_name, T_IPC_FT_STR, &user_name, T_IPC_FT_INT4, &pid, T_IPC_FT_STR, &project, NULL)) {/* error */
}/* Output the information */
TutOut("RTclient name = %s\n", client_name); TutOut("ident = %s\n", ident); TutOut("node name = %s\n", node_name); TutOut("user name = %s\n", user_name); TutOut("pid = %d\n", pid); TutOut("project = %s\n", project);
In many mission-critical applications, fault tolerance and reliability are important requirements. The system requires continuous operation, 24 hours a day, 7 days a week, regardless of hardware or software failures. Handling Network Failures In Publish Subscribe, describes how RTserver and RTclient achieve increased reliability in their communication by transparently checking for problems that occur when connecting, sending messages, or receiving messages.
You can achieve a higher level of reliability in your system through the use of software redundancy. Redundancy involves one or more backup processes for each primary process. For example, if you want to ensure that a specified RTclient continues to run regardless of problems that may occur in the network, you can run one or more backup RTclients as a mirror for each primary RTclient.
To minimize down time, you typically want to run the backup RTclient in a hot backup mode. This means the backup RTclient runs in parallel with the primary RTclient, with the backup typically running on a different computer for increased reliability. When the primary RTclient goes down or fails to respond for a given amount of time, control is transferred to the backup RTclient, and it then takes over as the primary, possibly spawning a new backup for itself.
Another easy way to implement a backup RTclient process is to use the SORTED load balancing mode described in Chapter 3, Publish-Subscribe.
When running a backup RTclient in parallel with the primary RTclient, these requirements should be met so as to maximize continuous operation:
SmartSockets provides a straightforward mechanism to meet these requirements in running a primary RTclient with a hot backup. For example, if you want to ensure than an RTclient runs continuously, two RTclients (both identical programs) could be started on different machines with both subscribing to identical subjects. Both RTclients receive the same messages and make equivalent calculations, keeping their internal states consistent. The backup RTclient can run in silent mode by setting the standard option Server_Msg_Send to FALSE
, thus preventing any messages from going out. Correspondingly, the primary RTclient should have its Server_Msg_Send option set to TRUE
, ensuring its results are sent out. When the primary RTclient fails, the backup RTclient can have its Server_Msg_Send option set to TRUE
, now making it the primary RTclient. Because there is now only a single RTclient running, a backup needs to be restarted and have its state restored to that of the primary.
To demonstrate this simple strategy of running redundant RTclients, one a mirror of the other, a complete example is shown and discussed in the following sections. This example uses a user-defined RTclient to monitor and control the primary and backup processes. This strategy is appealing in the fact that it is non-intrusive, allowing you to achieve the software fault tolerance without having to make any changes in the program used by the primary and backup RTclients.
Suppose that you want to run a hot backup for one RTclient in the project user_manual
. The subjects to subscribe to are set in a command file that is read at startup. In this example, the RTclient is only interested in processing messages sent to the chapter5
subject.
This is an example of the startup command file for the primary RTclient. The highlighted text is the text you must add to monitor the process for failure, and so failover can occur, if necessary:
The startup file for the backup RTclient is identical except it subscribes to the backup_client
subject instead of primary_client
subject, and it would have its Server_Msg_Send option set to FALSE as shown:
Note that in both command files the programs subscribe to the chapter5
subject. Sending a message to the chapter5
subject ensures that both the primary and backup RTclient receive the same message. The backup_client
subject is reserved for messages that need to be sent to the hot backup only.
In the following example program, a user-defined RTclient (guardian.c
) is used to perform the task of monitoring a primary and backup program. This program is referred to as the Guardian.
The source code files for this example are located in these directories:
The online source files have additional #ifdefs to provide C++ support; these #ifdefs are not shown to simplify the example.
This is the complete C source code example for the Guardian program:
/* guardian.c -- RTclient example for managing a hot backup process */
/*
USAGE
guardian.x <project name>
Also can use udrecv.c, udsend.c, primary.cm, and backup.cm to test.
This program is an RTclient that provides a primary and
backup RTclient for the project passed in as a command-line
argument. To use this, you must have two duplicate RTclients
with different startup command files.
In the first command file, which is for the primary RTclient, the
following lines should be added (or modified if they already exist):
setopt server_msg_send TRUE
setopt subjects <existing_subjects>, primary_client
The second command file, which is for the "hot backup", the
following lines should be added (or modified if they already exist):
setopt server_msg_send FALSE
setopt subjects <existing_subjects>, backup_client
You should also have shell scripts to start the primary and backup
RTclients named "startpcl" and "startbcl" respectively.
The guardian program then watches the primary_client and
backup_client and switches the backup over as needed when the primary
fails or restarts the backup if it fails.
*/
#define NOT_STARTED -1 #include <rtworks/ipc.h> static T_STR project_name; static T_INT4 num_primary = NOT_STARTED;/* Primary RTclient count */
static T_INT4 num_backup = NOT_STARTED;/* Backup RTclient count */
/* =============================================================== */
/*..cb_default -- default callback to handle unexpected messages */
static void T_ENTRY cb_default( T_IPC_CONN conn, T_IPC_CONN_DEFAULT_CB_DATA data, T_CB_ARG arg) { T_IPC_MT mt; T_STR name;/* print out the name of the type of the message */
if (!TipcMsgGetType(data->msg, &mt)) { TutOut("Could not get message type from message: error <%s>.\n", TutErrStrGet()); return; } if (!TipcMtGetName(mt, &name)) { TutOut("Could not get name from message type: error <%s>.\n", TutErrStrGet()); return; } TutOut("Unexpected message type name is <%s>\n", name); }/* cb_default */
/* =============================================================== */
/*..cb_subject_status -- callback to process MON_SUBJECT_SUBSCRIBE_STATUS messages */
static void T_ENTRY cb_subject_status( T_IPC_CONN conn, T_IPC_CONN_PROCESS_CB_DATA data, T_CB_ARG arg) { T_IPC_MSG msg = data->msg; T_IPC_MT mt; T_STR subject_name; T_STR lc_subject_name; /* lower-case version of subject_name */
T_INT4 subject_count; T_REAL8 wall_time; T_STR time_str; T_STR *subscribe_client_names;/*
* cb_subject_status is called when a MON_SUBJECT_SUBSCRIBE_STATUS
* subject status message is received. Each time an RTclient starts
* or stops subscribing to the primary_client or backup_client
* subjects this function is called.
*/
/* Get the current wall clock time and convert it to a string */
wall_time = TutGetWallTime(); time_str = TutTimeNumToStr(wall_time);/* Parse the subject status message for the subject id and */
/* the number of RTclients subscribed to it. */
/* Set current field */
if (!TipcMsgSetCurrent(msg, 0)) { TutOut("Could not set current field of message: error <%s>.\n", TutErrStrGet()); return; }/* Get the fields of interest from the message */
if (!TipcMsgNextStr(msg, &subject_name) || !TipcMsgNextStrArray(msg, &subscribe_client_names, &subject_count)) { TutOut("Unable to access MON_SUBJECT_SUBSCRIBE_STATUS"); TutOut("message: error <%s>.\n", TutErrStrGet()); return; } TutOut("Time = %s\n", time_str ); TutOut("%d RTclients are subscribing to the %s subject.\n", subject_count, subject_name );/* make a copy of the subject name and convert to lower case */
T_STRDUP(lc_subject_name, subject_name); TutStrLwr(lc_subject_name, lc_subject_name); if (strcmp(lc_subject_name, "primary_client") == 0) { num_primary = subject_count; } else if (strcmp(lc_subject_name, "backup_client") == 0) { num_backup = subject_count; } else { TutOut("Received SUBJECT_SUBSCRIBE_STATUS for unwanted"); TutOut("subject %s.\n", lc_subject_name); T_FREE(lc_subject_name); return; }/* Prepare to send CONTROL messages later on. */
mt = TipcMtLookupByNum(T_MT_CONTROL); if (mt == NULL) { TutOut("Could not look up control message: error <%s>.\n", TutErrStrGet()); return; }/*
* There are four main cases that need to be covered. They are
* listed below followed by the actions to be taken when the case
* is encountered:
* CASE 1: Neither primary nor backup RTclient is running.
* start primary RTclient
* start backup RTclient
* CASE 2: Both primary and backup RTclients are running.
* output message that all is OK
* CASE 3: Primary RTclient fails; backup RTclient is running.
* set server_msg_send option in backup RTclient to TRUE
* Have backup start subscribing to the primary_client subject
* Have backup stop subscribing to the backup_client subject
* (This will then cause CASE 4 to occur);
* CASE 4: Backup RTclient fails, primary RTclient is running.
* Restart backup RTclient
*/
/* Check if neither the primary RTclient nor backup RTclient have */
/* been started yet */
if (num_primary == 0 && num_backup == 0) { TutOut("Neither primary nor backup RTclient yet started.\n"); TutOut("Starting primary RTclient...\n"); TutSystem("startpcl &"); TutOut("Starting backup RTclient...\n"); TutSystem("startbcl &"); } else if (num_primary == 1 && num_backup == 1) { TutOut("Both primary and backup RTclients are running!\n"); } else if (strcmp(lc_subject_name, "primary_client") == 0) { if (num_primary == 1 && num_backup <= 0) { TutOut("Primary RTclient up and running! Waiting on "); TutOut("backup...\n"); } else if (num_primary == 0 && num_backup <= 0) { TutOut("No primary RTclient yet; No report yet from backup \n"); }/* Check if we have lost the primary RTclient */
else if (num_primary == 0 && num_backup == 1 ){ TutOut("Primary RTclient has failed!\n"); TutOut("Switching the backup RTclient to be primary...\n"); if (!TipcSrvMsgWrite("backup_client", mt, TRUE, T_IPC_FT_STR, "setopt server_msg_send TRUE", NULL)) { TutOut("Could not send setopt control message to "); TutOut("backup_client: error <%s>.\n", TutErrStrGet()); } if (!TipcSrvMsgWrite("backup_client", mt, TRUE, T_IPC_FT_STR, "subscribe primary_client", NULL)) { TutOut("Could not send subscribe control message to "); TutOut("backup_client: error <%s>.\n", TutErrStrGet()); } if (!TipcSrvMsgWrite("backup_client", mt, TRUE, T_IPC_FT_STR, "unsubscribe backup_client", NULL)) { TutOut("Could not send unsubscribe control message to "); TutOut("backup_client: error <%s>.\n", TutErrStrGet()); } } else { TutOut("We have an irregular number of RTclients!\n"); TutOut("Number of primary RTclients: %d\n", num_primary); TutOut("Number of backup RTclients: %d\n", num_backup); } } else if (strcmp(lc_subject_name, "backup_client") == 0) { if (num_primary <= 0 && num_backup == 1) { TutOut("Backup RTclient up and running! Waiting on "); TutOut("primary...\n"); } else if (num_primary <= 0 && num_backup == 0) { TutOut("No backup RTclient yet; "); TutOut("No report received yet from primary RTclient.\n"); }/* Check if we have lost the backup RTclient */
else if (num_primary == 1 && num_backup == 0){ TutOut("Backup RTclient is down!\n"); TutOut("Starting a new backup RTclient!\n"); TutSystem("startbcl &"); } else { TutOut("We have an irregular number of RTclients!\n"); TutOut("Number of primary RTclients : %d\n", num_primary); TutOut("Number of backup RTclients : %d\n", num_backup); } } TutOut("================================\n"); T_FREE(lc_subject_name); }/* cb_subject_status */
/* =============================================================== */
/*..main -- main program */
int main(argc, argv) int argc; char **argv; { T_OPTION option; T_IPC_MT mt;/* Check the command-line arguments */
if (argc != 2) { TutOut("Usage: guardian <project>\n"); TutExit(T_EXIT_FAILURE); }/* Save the pointer to the command line argument */
project_name = argv[1]; TutOut("Monitoring project <%s>...\n", project_name);/* Set the project name */
option = TutOptionLookup("project"); if (option == NULL) { TutOut("Could not look up option named project: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); } if (!TutOptionSetEnum(option, project_name)) { TutOut("Could not set the option named <project>: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }/* Set the time format for the FULL format */
TutCommandParseStr("setopt time_format full");/* Create a connection to RTserver */
if (!TipcSrvCreate(T_IPC_SRV_CONN_FULL)) { TutOut("Could not connect to RTserver: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }/* Create callback to process MON_SUBJECT_SUBSCRIBE_STATUS msgs */
mt = TipcMtLookupByNum(T_MT_MON_SUBJECT_SUBSCRIBE_STATUS); if (mt == NULL) { TutOut("Could not look up MON_SUBJECT_SUBSCRIBE_STATUS"); TutOut("message type: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); } if (TipcSrvProcessCbCreate(mt, cb_subject_status, NULL) == NULL) { TutOut("Could not create MON_SUBJECT_SUBSCRIBE_STATUS"); TutOut("process callback: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }/* Create default callback to handle unwanted message types */
if (TipcSrvDefaultCbCreate(cb_default, NULL) == NULL) { TutOut("Could not create default callback: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }/* Start watching primary_client and backup_client subjects */
TutOut("Starting to watch <primary_client> subject.\n"); if (!TipcMonSubjectSubscribeSetWatch("primary_client", TRUE)) { TutOut("Could not start watching primary_client subject.\n"); TutOut(" error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); } TutOut("Starting to watch <backup_client> subject.\n"); if (!TipcMonSubjectSubscribeSetWatch("backup_client", TRUE)) { TutOut("Could not start watching backup_client subject.\n"); TutOut(" error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }/* If RTserver stops, then TipcSrvMainLoop will restart RTserver */
/* and return FALSE. We can safely continue. */
for (;;) { if (!TipcSrvMainLoop(T_TIMEOUT_FOREVER)) { TutOut("TipcSrvMainLoop failed: error <%s>.\n", TutErrStrGet()); } }/* This line should never be reached */
TutOut("This line should never be reached!\n"); return T_EXIT_FAILURE; }/* main */
To compile, link, and run the guardian
program, first you must either copy the program to your own directory or have write permission in these directories:
In addition to the guardian
program, these two other programs are provided so you can test the application:
udrecv.c
This program outputs field of COUNT
message (an integer) and sends it back to udsend.c
(two instances of this program running to simulate a primary and backup RTclient). This program takes a single command line argument, the name of a command file. In the test these command files are primary.cm
and backup.cm
.
udsend.c
This program sends a COUNT message (an integer, incremented each time) to the chapter5 subject; also has a callback to output the field of any COUNT messages sent back to it from udrecv.c
. COUNT is a user-defined message type that consists of one field, an integer. The program udsend
increments this one time, each time it sends a message.
Compile and link all three programs, guardian, udrecv, and udsend
$ cc guardian.c $ rtlink /exec=guardian.exe guardian.obj $ cc udrecv.c $ rtlink /exec=udrecv.exe udrecv.obj $ cc udsend.c $ rtlink /exec=udsend.exe udsend.obj
Run the stguardn command
To run the test, run the stguardn
command that starts guardian
, two instances of udrecv
(one primary and one backup), and one udsend
.
This is an example of the output:
Monitoring project <user_manual>... Connecting to project <user_manual> on <_node> RTserver. Using local protocol. Message from RTserver: Connection established. Start subscribing to subject </_workstation1_2252>. Starting to watch <primary_client> subject. Starting to watch <backup_client> subject. Time = Fri Mar 14 12:54:50.567 1997 1 RTclients are subscribing to the primary_client subject. Primary RTclient up and running! Waiting on backup... ================================ Time = Fri Mar 14 12:54:50.689 1997 1 RTclients are subscribing to the backup_client subject. Both primary and backup RTclients are running! ================================
Test failover by killing udrecv
Once both the primary and backup udrecv
programs are running, along with udsend
, test the failover by killing the primary udrecv
or backup udrecv
programs.
If the primary udrecv
exits, then Guardian produces output similar to:
Time = Fri Mar 14 12:55:37.929 1997 0 RTclients are subscribing to the primary_client subject. Primary RTclient has failed! Switching the backup RTclient to be primary... ================================ Time = Fri Mar 14 12:55:38.138 1997 1 RTclients are subscribing to the primary_client subject. Both primary and backup RTclients are running! ================================ Time = Fri Mar 14 12:55:38.180 1997 0 RTclients are subscribing to the backup_client subject. Backup RTclient is down! Starting a new backup RTclient! ================================ Time = Fri Mar 14 12:55:42.926 1997 1 RTclients are subscribing to the backup_client subject. Both primary and backup RTclients are running! ================================
Once both the primary and backup udrecv
s are running, if the backup udrecv
exits, then Guardian produces output similar to:
Time = Fri Mar 14 12:55:22.772 2002 0 RTclients are subscribing to the backup_client subject. Backup RTclient is down! Starting a new backup RTclient! ================================ Time = Fri Mar 14 12:55:27.752 2002 1 RTclients are subscribing to the backup_client subject. Both primary and backup RTclients are running! ================================
Guardian uses the function TipcMonSubjectSubscribeSetWatch to set up subscription monitoring on the primary_client
and backup_client
subjects:
if (!TipcMonSubjectSubscribeSetWatch("primary_client", TRUE)) { TutOut("Could not start watching primary_client subject.\n"); TutOut(" error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); } if (!TipcMonSubjectSubscribeSetWatch("backup_client", TRUE)) { TutOut("Could not start watching backup_client subject.\n"); TutOut(" error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }
Whenever the subscriber count for the primary_client subject changes, RTserver sends a MON_SUBJECT_SUBSCRIBE_STATUS message to Guardian. An RTclient message process callback function must be created for this message type within Guardian:
mt = TipcMtLookupByNum(T_MT_MON_SUBJECT_SUBSCRIBE_STATUS); if (mt == NULL) { TutOut("Could not look up MON_SUBJECT_SUBSCRIBE_STATUS"); TutOut("message type: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); } if (TipcSrvProcessCbCreate(mt, cb_subject_status, NULL) == NULL) { TutOut("Could not create MON_SUBJECT_SUBSCRIBE_STATUS"); TutOut("process callback: error <%s>.\n", TutErrStrGet()); TutExit(T_EXIT_FAILURE); }
Whenever an RTclient starts or stops subscribing to either the primary_client
or backup_client
subjects, a MON_SUBJECT_SUBSCRIBE_STATUS message is sent by RTserver to Guardian, causing the callback function cb_subject_status to be executed. The function cb_subject_status provides the failover and starting of the RTclients.
There are five cases that cb_subject_status must be able to handle:
if (!TipcSrvMsgWrite("backup_client", mt, TRUE, T_IPC_FT_STR, "setopt server_msg_send TRUE", NULL)) { TutOut("Could not send setopt control message to "); TutOut("backup_client: error <%s>.\n", TutErrStrGet()); }
primary_client
subject. In essence, this makes the backup now the primary RTclient:if (!TipcSrvMsgWrite("backup_client", mt, TRUE, T_IPC_FT_STR, "subscribe primary_client", NULL)) { TutOut("Could not send subscribe control message to "); TutOut("backup_client: error <%s>.\n", TutErrStrGet()); }
primary_client
and backup_client
subjects. To complete the hot failover, a CONTROL message is sent to the backup RTclient to stop subscribing to the backup_client
subject. This completes the transition of the backup to the primary.
When the backup_client
subject is no longer being subscribed to, a new MON_SUBJECT_SUBSCRIBE_STATUS message is issued by RTserver, specifying the process count for the backup_client
subject is now zero, thus causing case 4 (described below) to occur.
TutSystem("startbcl &")The ampersand (
&
) is critical in the call to TutSystem as it starts the process in the background (in OpenVMS this is called spawning the process) and returns immediately, without waiting for it to complete. Failure to specify the ampersand (&
) causes the Guardian process to hang.
A key question is whether one can lose any outgoing messages from using this approach. The only time this happens is from the time RTserver detects the loss of the primary RTclient to the time that Guardian is able to switch the backup RTclient to be primary. This time period in most cases should be less than one second. If the primary RTclient fails at the same time it is sending out messages, there is potential loss of data. This risk can be further reduced by using guaranteed message delivery. See Chapter 4, Guaranteed Message Delivery, for more details.
TIBCO SmartSockets™ User’s Guide Software Release 6.8, July 2006 Copyright © TIBCO Software Inc. All rights reserved www.tibco.com |