TO DO Lists: Difference between revisions
(50 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
=== | == Sergey Boyarinov's TODO list == | ||
- Oct 24, 2023: test setups work (SAMPA, VMM, URWELL) | |||
- | - Oct 26, 2023: EBDAQ^ crashed with message 'et_events_new returns ERROR = -8'; it was ETDAQ6 scan from 129.57.71.79 but unclear if it was in the same time; | ||
TODO: use allow list (our subnets) instead of deny from .71. in dac.s/coda_component.c and dac.s/tcpServer.c, and print addresses of successful connections; | |||
add dates whenever error message is printed; | |||
- write | === OLD STUFF === | ||
- 29-mar-2010: send CLAS12 required CAEN HV list to Fernando for joined req - during a week | |||
- 14-aug-2009: buy new monitors to counting house | |||
- event_monitor sometimes does not create correct histogram file at the end of run | |||
- in some configurations where L1 not used, message 'L1 does not agree.." pops up from run sheet; need to resolve | |||
- check routing table on clon10 and clon00, 'ssh bethsw1' does not work (see Brent's mail) | |||
- need to fix runcontrol graphic problem; also will try to install NVIDIA driver on clon03, switch back clon03:0.0 and clon03:0.1, and set 1600x1200 resolution for clon01:0.1 | |||
- got following trying to ssh: | |||
[boiarino@claspc7 ~]$ ssh clasrun@clon04 | |||
do_ypcall: clnt_call: RPC: Timed out | |||
clasrun@clon04's password: | |||
long delay was observed between first and second line. Need to understand. | |||
- make internal links on clonweb wiki visible (no 'clonweb' in links, but what instead ?) | |||
- add /usr/clas/archive automount to the Linux procedures (see clon04) | |||
- found a solution for ''et_2_et_10_00'' etc problem: we are running 3 et_2_et's on clon10, question is how our monitoring will distinguish them ? ''ps'' command cuts parameters, and process name ''et_2_et'' is the same for all of them; temporary et_2_et included in to 2 others (see ''$CODA/src/et/main''). | |||
- look at UEP15 NIM power supply from WIENER (plug into old NIM bin) | |||
- decide on electronics inventory database and form of presentation - with Sergey P. | |||
- CODA: use et_2_et everywhere (check byte swap!) | |||
- respond on property (see email Oct-9-2007) | |||
- there is an incompartibility on Solaris 10 update 2 with new Studio 12 (bosio does not compile for example), need to fix, maybe install Solaris 10 update 3 from scratch ? if so, clon10 and clon00 need extra work installing dns, nis, bootp, realport etc | |||
- modify CMON and CED for new TOF counters (ask Heddle for CED) | |||
- make datafile for DVCS Trigger simulation using BOS file from Rustam | |||
Oct-9-2007: done, Rustam sent me pedestal file, I generated 10000 events ascii | |||
file for Ben, it is on DVCS Trigger wiki page, sent mail to Ben; program is | |||
$CODA/src/bosio/main/bosdvcstrig.c | |||
- talk to Hovanes about adding new names 'lac' and 'dc' to the dictionary | |||
generation procedure (Makefile modifications, save/restore procedure etc); | |||
update wiki page accordingly | |||
Oct-5-2007: Hovanes shown me how to do that, we did it in new EPICS, must write it | |||
to wiki and repeat in old epics | |||
- buy 2 more A1520 500V CAEN boards on Stepan's request for hodoscope | |||
Oct-8-2007: got quotation for A1520's, v288's and v895's | |||
- arrange AC socket replacement on sy527's | |||
- ''preshower_exit'' did not kill rcServer (probably ''daq_exit'' too) - need check | |||
- old sy527 driver for VME-based caenet board | |||
Sep-2007: need to pass to Nerses what I did; | |||
Oct-1-2007: doing it myself with Nerses's help | |||
Oct-4-2007: driver is finished, last puzzle was switching on floating point in vxWorks; | |||
everything works fine; remaining changes: make status bits consistent with sy1527s, | |||
check if 'input V' field must be filled in; merge with sy1527 driver; tune timing | |||
inside v288.c and sy527.c; check power/enable handling (currently 'dis'=Pw and PrON, 'Ena'=PrOn); | |||
check ID's (where it should be 0, where 4 etc) | |||
Oct-9-2007: finish another round of tuning; add extra checks, particularly in v288Get calls, call v288Reset | |||
in case of transmission problems, fix some sleeps, set lower priority for main thread to let EPICS | |||
processes breeze; still have error messages, but ioc seems stable thanks to v288Reset; | |||
TODO: make sure enable/disable and on/off logic works correctly (probably in v288.c) | |||
- cleanup/replug/label BigIron switch, update MRTG names | |||
- shutting down clasonl1 with command | |||
shutdown -y -i 5 -g 120 | |||
got following | |||
Shutdown started. Fri Sep 28 16:20:49 EDT 2007 | |||
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:20:49... | |||
The system clasonl1 will be shut down in 2 minutes | |||
showmount: clasonl1: RPC: Program not registered | |||
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:21:49... | |||
The system clasonl1 will be shut down in 1 minute | |||
showmount: clasonl1: RPC: Program not registered | |||
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:22:19... | |||
The system clasonl1 will be shut down in 30 seconds | |||
showmount: clasonl1: RPC: Program not registered | |||
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:22:39... | |||
THE SYSTEM clasonl1 IS BEING SHUT DOWN NOW ! ! ! | |||
Log off now or risk your files being damaged | |||
showmount: clasonl1: RPC: Program not registered | |||
Changing to init state 5 - please wait | |||
clasonl1:/root> | |||
Need to understand. | |||
- v1190/v1290 testing: CBLT does not work, wrong slot number | |||
- | - runcontrol does not compiles on clon10, but compiles on clon03 - new studio12 etc - need to install Sol10 U3 everywhere !; runcontrol hungs in 'go' for TDC_CALIB - working on it | ||
- write notes on standalone DAQ operations (enable signal necessity etc) | |||
- equipment list with DB, lebles (with Sergey P.) | - equipment list with DB, lebles (with Sergey P.) | ||
- JLAB discriminators: ask Volker to push it | - JLAB discriminators: ask Volker to push it | ||
28-set-2007: sent email to Volker, he replied asking for extra info, sent email with requested info | |||
- need 1881M ADCs, at least few modules | - need 1881M ADCs, at least few modules | ||
- test sy527 which arrived from repair (with George Jacobs) | - test sy527 which arrived from repair (with George Jacobs) | ||
28-sep-2007: unit tested, boots fine, serial works; George will test alarm output, Sergey - remote control | |||
28-sep-2007: AC connectors arrived, George will arrange replacement (or me ?); box with connectors in | |||
counting room on the table | |||
- buy labels for both labeling machines | - buy labels for both labeling machines | ||
17-sep-2007 requisitions 266507 and 266519 have been submitted | 17-sep-2007 requisitions 266507 and 266519 have been submitted | ||
26-sep-2007 labels for panel labler received | |||
- learn, start and test auto-shutdown software on clons | |||
29-sep-2007 soft installed on 18 machines, partially configured, need to send signal from UPS_CHB1 and make sure | |||
it reaches all machines | |||
- on Nerses's request: install S99caRepeater and S99logServer scripts, | |||
update corresponding procedure for Solaris (ask Paul Letta if necessary) | |||
14-sep-2007: installed to ''/etc/rc3.d'' on clon01, but actial reboot was not tested | |||
- fix colors for clasrun accounts (and others ?) on clons | |||
- order 2 more discriminators for DVCSCAL | |||
- on clon10 msql and rtserver must restart automatically on reboot | |||
- nrpe did not restarted on clon06, it was in maintenance state; as clasrun I did | |||
disable and enable - it works now; probably network was down when it was trying to restart ? | |||
- make sure emergency generator is by-passed if in service | |||
26-sep-2007: sent mail to Bob Rice | |||
26-sep-2007: I decided to submit request to inforce better power routing during emergency generator panel repair: normal | |||
power must be supplied during that time | |||
== Sergey Boyarinov's COMPLETED list == | |||
- replug AC power to emergency generators | |||
17-sep-2007: done | |||
- NTP servers on Solaris, update Solaris post-install page (and Linux on clon04) | |||
24-sep-2007: done | |||
- get TIBCO license: | |||
17-sep-2007 requisition 266520 has been submitted | |||
temporary password received, changed to permanent; access to tibco web site works | |||
25-sep-2007: done | |||
- PCAL test setup (with Sergey P.) | |||
14-sep-2007: everything seems working, except missing delay cables and wierd | |||
problem: FASTBUS hung on first event | |||
26-sep-2007: replaced signal distr. card (TDC START did not work) | |||
27-sep-2007: add 1 sec delay in 'Go' trabsition of fbrol1.c for standalone only, it seems | |||
fixed first-event-hung problem; everything seems Ok | |||
27-sep-2007: done | |||
- make paper 'DVCSCAL trigger system' and pass it to Chris and Ben | |||
25-sep-2005 create new page DVCS Trigger System in clonweb wiki, document is there | |||
25-sep-2007: done | |||
- reply to Motorola about prpmc880 | |||
with Sergey P.: we asked Motorola for replacement, they offered PrPMC280-like replacement, Sergey on top of that, I'm done | |||
for now | |||
28-sep-2007: done | |||
- on clonweb wrong apache is starting on reboot (check nagios and mrtg as well) | |||
30-sep-2007: done | |||
- on upcoming run request ran 3 ethernet cables and 1 serial cable to the target cartrige; new name 'ioctstarg' was added to the tsconnect.conf; all connections are tested | |||
1-oct-2007: done | |||
- ET system debugging (with Carl) | - ET system debugging (with Carl) | ||
Line 124: | Line 299: | ||
ET user library >/usr/local/clas/devel/coda/Darwin_powerpc/lib/libet_user.so< will be used | ET user library >/usr/local/clas/devel/coda/Darwin_powerpc/lib/libet_user.so< will be used | ||
- | Oct-3-2007: got new version from Carl, test et_2_et: can connect through multiple ports to multiple ETs, BUT cannot connect between machines on different subnets. Tried DIRECT - does not work, ask Carl to check. | ||
Oct-4-2007: pcal EB and ER are not working with new ET !!! Switch to old one, need check ... | |||
Oct-9-2007: final check, everything looks good; add new option 'direct' to 'et_2_et', will be used for communication between machines which are not sharing any subnets, in that case only one ET system on remote machine can be connected as before; if machines are sharing at least one subnet, there is no limitations for the number of ET systems we can communicate with - good enough for CLON cluster | |||
Oct-9-2007: done | |||
- order extended maintenance kit for clonhp2 | |||
Oct-9-2007: done | |||
== tNetTask suspended problem debugging == | |||
TROUBLESHOOTING SIMPLE NETWORK PROBLEMS IN TORNADO 2.0 | |||
------------------------------------------------------ | |||
If you are new to Tornado 2.0, or you are booting a target for | |||
the first time, or if the target booted successfully, but does | |||
not respond to pings, please read basicNetConfiguration article. | |||
basicNetConfiguration contains instructions on how to configure a | |||
kernel with a target shell and network debugging libraries. It | |||
contains information to avoid errors like: | |||
rpccore backend client RPC: Timed out | |||
Network interface unknown | |||
muxDevLoad failed for device entry | |||
wdbConfig: error configuring WDB communication interface | |||
Error loading file: errno = 0xd0003 | |||
If target responded to pings initially, and after running your | |||
socket application (or a library that uses the network such as | |||
the FTP server), applications sputter or hang for a while and | |||
then resume, please read netPerformance article. It gives an | |||
overview of vxWorks network internals and configuration issues | |||
that impact performance. | |||
This article assumes that a target has booted successfully and | |||
responds to pings initially. You are either running demo code | |||
from WindSuf->Demo Code->Networking or your own application. | |||
At some point, the target does not respond to pings anymore, or | |||
network performance is slow. At this point from the target shell, | |||
verify that tNetTask is still running at priority 50. | |||
If i shows tNetTask as suspended, call tt tNetTask. | |||
There might be a reference to a null pointer somewhere, or an | |||
unitialized interrupt. If this error occurs with a default | |||
configuration, you have not added third party network drivers, or | |||
software, your application is not running, and you are just using | |||
WRS software on a supported BSP, please contact: | |||
support@windriver.com. | |||
If application runs fine, but as you increase the number of | |||
sockets or the packet rate, the application sputters or hangs, | |||
run the show functions described below from the target shell | |||
AFTER the target boots and AFTER the problem occurs. | |||
->netStackSysPoolShow | |||
Network stack system memory pool status | |||
->netStackDataPoolShow | |||
Network stack data memory pool status | |||
->i | |||
which tasks are running, show priorities | |||
->inetstatShow | |||
shows sockets and their receive queues | |||
->ipstatShow | |||
show IP protocol statistics | |||
->tcpstatShow | |||
show TCP protocol statistics | |||
->udpstatShow | |||
show UDP protocol statistics | |||
->icmpstatShow | |||
show ICMP protocol statistics | |||
-->mRouteShow | |||
show route table, including masks | |||
->routeShow | |||
show route table without masks | |||
->iosFdShow | |||
show file descriptors | |||
->arpShow | |||
show ARP table contents | |||
->hostShow | |||
show host table content | |||
>ifShow | |||
show interfaces attached to IP. This function will show | |||
END and BSD44 drivers. | |||
->muxShow | |||
show END drivers successfully loaded in the MUX | |||
->endPoolShow("name", 0) | |||
END driver memory pool status of driver "name" | |||
For example: endPoolShow("fei", 0) | |||
endPoolShow is not part of the API. This utility is | |||
found in utilities.c | |||
SYMPTOMS: | |||
A. ENOBUFS | |||
SOCKET APPLICATION SPUTTERS | |||
If the network stack or driver memory pool sizes are too small, | |||
applications will appear to hang and recover within minutes as | |||
resources are freed. Errors like ENOBUFS, or | |||
S_netBufLib_NO_POOL_MEMORY may occur. | |||
The problem is most likely insufficient network stack and | |||
driver buffer pool. Call netStackSysPoolShow and | |||
netStackDataPoolShow. Increase the default allocation if | |||
there are 0 free buffers available. Increase the driver's | |||
buffer pool, and IP configuration parameters as indicated | |||
in netPerformance. | |||
B. TARGET STOPS RESPONDING TO PINGS. | |||
RPC CORE BACKEND TIMEOUT REPORTED IN TARGET SERVER WINDOW | |||
1. APPLICATION/DRIVER BUFFER LOANING PROBLEMS | |||
If sockets buffers are not being read, driver buffers will be | |||
depleted. As long as the driver is out of buffers, the driver | |||
will not be able to receive or send packets. Pings will be | |||
unanswered. Target server will be unable to contact the | |||
target. | |||
muxShow displays END drivers only. If the boot device (or the | |||
other parameter if boot device is floppy, ata...) is listed | |||
in muxShow, the driver version used is END. | |||
ifShow displays both END and BSD 4.4 drivers. | |||
endPoolShow displays END driver's buffer pool status. If | |||
there are 0 free buffers, use inetstatShow to determine which | |||
socket has data in the receive queue. This information can | |||
be used to identify which task is not reading its sockets. | |||
See netPerformance for an explanation of network driver | |||
buffer loaning scheme. | |||
2. DRIVER PROBLEMS | |||
If inetstatShow shows no data in the receive queue, the | |||
problem may be due to a faulty driver: | |||
If you are using the END driver version, try the BDS44 version | |||
(if available), and vice-versa. If the BSP supports more than | |||
1 network card type, try a different card. If muxShow does not | |||
show any drivers, BSD44 drivers are configured. BSD44 drivers | |||
are configured if EITHER T101 build method is used, and | |||
INCLUDE_END is undefined in config.h OR in T2 Project | |||
facility: | |||
network components->network devices->End attach interface and | |||
End interface support components are excluded. | |||
See basicNetConfiguration document for information on | |||
configuring drivers. If changing drivers works, report | |||
the problem to support@windriver.com. | |||
- | Check WindSurf->Resolve to check Problem Lists and Keyword | ||
Search. | |||
If there are no other driver versions you can use, use the | |||
loopback driver. Run both the server and the client code in | |||
the target. Send to IP address 127.0.0.1. | |||
C. TARGER RESPONDS TO PINGS, BUT DATA TRANSFER STOPS | |||
If the target responds to pings, but data transfer stops, call | |||
inetstatShow on the target and the corresponding function | |||
(usually netstat) on the other host. If there is data backed | |||
up in the send side and on the receive side on the other side, | |||
most likely there is a deadlock situation within the | |||
client/server application code. Check if deadlock does not | |||
occur if you reduce the message size below 1460 bytes. | |||
(For TCP). TCP is a stream of bytes. Loops are required | |||
to read or write, since TCP can return partial reads and | |||
partial writes. | |||
D. APPLICATION WORKS BUT NETWORK PERFORMANCE IS SLOW | |||
See netPerformance article. | |||
E. PROTOCOL PROBLEMS | |||
Check the before and after protocol error statistics show | |||
routines for clues as to which error is increasing. Please | |||
read Section 1.4.1 of NetPerformance article for an | |||
explanation on UDP full sockets. | |||
F. PRIORITY PROBLEMS | |||
Make sure that tNetTask has a higher priority than application | |||
tasks depending on the network. Please read Section 1.2 of | |||
netPerformance article. | |||
G. HARDWARE OR DRIVER MISCONFIGURATION PROBLEMS | |||
1. Try changing the network card or using a different card. | |||
Examine hub's LED. Verify that the LED shows activity for | |||
the driver. | |||
2. Configure the BSD44 driver version, if available. Call | |||
ifShow to determine if there are collisions, CRC ... The end | |||
driver does not report this information. If the driver | |||
reports excessive collisions make sure that another node in | |||
the network is not in full duplex mode. | |||
3. If there are broadcast storms, verify that ifShow does not | |||
show driver is in PROMISCUOUS mode. | |||
------_extPart_000_01C0FFDE.1E130740 | |||
Content-Type: application/octet-stream; | |||
nametilities.c" | |||
Content-Transfer-Encoding: quoted-printable | |||
Content-Disposition: attachment; | |||
filenametilities.c" | |||
/********* |
Latest revision as of 14:33, 26 October 2023
Sergey Boyarinov's TODO list
- Oct 24, 2023: test setups work (SAMPA, VMM, URWELL)
- Oct 26, 2023: EBDAQ^ crashed with message 'et_events_new returns ERROR = -8'; it was ETDAQ6 scan from 129.57.71.79 but unclear if it was in the same time; TODO: use allow list (our subnets) instead of deny from .71. in dac.s/coda_component.c and dac.s/tcpServer.c, and print addresses of successful connections; add dates whenever error message is printed;
OLD STUFF
- 29-mar-2010: send CLAS12 required CAEN HV list to Fernando for joined req - during a week
- 14-aug-2009: buy new monitors to counting house
- event_monitor sometimes does not create correct histogram file at the end of run
- in some configurations where L1 not used, message 'L1 does not agree.." pops up from run sheet; need to resolve
- check routing table on clon10 and clon00, 'ssh bethsw1' does not work (see Brent's mail)
- need to fix runcontrol graphic problem; also will try to install NVIDIA driver on clon03, switch back clon03:0.0 and clon03:0.1, and set 1600x1200 resolution for clon01:0.1
- got following trying to ssh:
[boiarino@claspc7 ~]$ ssh clasrun@clon04 do_ypcall: clnt_call: RPC: Timed out clasrun@clon04's password:
long delay was observed between first and second line. Need to understand.
- make internal links on clonweb wiki visible (no 'clonweb' in links, but what instead ?)
- add /usr/clas/archive automount to the Linux procedures (see clon04)
- found a solution for et_2_et_10_00 etc problem: we are running 3 et_2_et's on clon10, question is how our monitoring will distinguish them ? ps command cuts parameters, and process name et_2_et is the same for all of them; temporary et_2_et included in to 2 others (see $CODA/src/et/main).
- look at UEP15 NIM power supply from WIENER (plug into old NIM bin)
- decide on electronics inventory database and form of presentation - with Sergey P.
- CODA: use et_2_et everywhere (check byte swap!)
- respond on property (see email Oct-9-2007)
- there is an incompartibility on Solaris 10 update 2 with new Studio 12 (bosio does not compile for example), need to fix, maybe install Solaris 10 update 3 from scratch ? if so, clon10 and clon00 need extra work installing dns, nis, bootp, realport etc
- modify CMON and CED for new TOF counters (ask Heddle for CED)
- make datafile for DVCS Trigger simulation using BOS file from Rustam
Oct-9-2007: done, Rustam sent me pedestal file, I generated 10000 events ascii file for Ben, it is on DVCS Trigger wiki page, sent mail to Ben; program is $CODA/src/bosio/main/bosdvcstrig.c
- talk to Hovanes about adding new names 'lac' and 'dc' to the dictionary generation procedure (Makefile modifications, save/restore procedure etc); update wiki page accordingly
Oct-5-2007: Hovanes shown me how to do that, we did it in new EPICS, must write it to wiki and repeat in old epics
- buy 2 more A1520 500V CAEN boards on Stepan's request for hodoscope
Oct-8-2007: got quotation for A1520's, v288's and v895's
- arrange AC socket replacement on sy527's
- preshower_exit did not kill rcServer (probably daq_exit too) - need check
- old sy527 driver for VME-based caenet board
Sep-2007: need to pass to Nerses what I did; Oct-1-2007: doing it myself with Nerses's help Oct-4-2007: driver is finished, last puzzle was switching on floating point in vxWorks; everything works fine; remaining changes: make status bits consistent with sy1527s, check if 'input V' field must be filled in; merge with sy1527 driver; tune timing inside v288.c and sy527.c; check power/enable handling (currently 'dis'=Pw and PrON, 'Ena'=PrOn); check ID's (where it should be 0, where 4 etc) Oct-9-2007: finish another round of tuning; add extra checks, particularly in v288Get calls, call v288Reset in case of transmission problems, fix some sleeps, set lower priority for main thread to let EPICS processes breeze; still have error messages, but ioc seems stable thanks to v288Reset; TODO: make sure enable/disable and on/off logic works correctly (probably in v288.c)
- cleanup/replug/label BigIron switch, update MRTG names
- shutting down clasonl1 with command
shutdown -y -i 5 -g 120
got following
Shutdown started. Fri Sep 28 16:20:49 EDT 2007 Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:20:49... The system clasonl1 will be shut down in 2 minutes showmount: clasonl1: RPC: Program not registered Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:21:49... The system clasonl1 will be shut down in 1 minute showmount: clasonl1: RPC: Program not registered Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:22:19... The system clasonl1 will be shut down in 30 seconds showmount: clasonl1: RPC: Program not registered Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:22:39... THE SYSTEM clasonl1 IS BEING SHUT DOWN NOW ! ! ! Log off now or risk your files being damaged showmount: clasonl1: RPC: Program not registered Changing to init state 5 - please wait clasonl1:/root>
Need to understand.
- v1190/v1290 testing: CBLT does not work, wrong slot number
- runcontrol does not compiles on clon10, but compiles on clon03 - new studio12 etc - need to install Sol10 U3 everywhere !; runcontrol hungs in 'go' for TDC_CALIB - working on it
- write notes on standalone DAQ operations (enable signal necessity etc)
- equipment list with DB, lebles (with Sergey P.)
- JLAB discriminators: ask Volker to push it
28-set-2007: sent email to Volker, he replied asking for extra info, sent email with requested info
- need 1881M ADCs, at least few modules
- test sy527 which arrived from repair (with George Jacobs)
28-sep-2007: unit tested, boots fine, serial works; George will test alarm output, Sergey - remote control 28-sep-2007: AC connectors arrived, George will arrange replacement (or me ?); box with connectors in counting room on the table
- buy labels for both labeling machines
17-sep-2007 requisitions 266507 and 266519 have been submitted 26-sep-2007 labels for panel labler received
- learn, start and test auto-shutdown software on clons
29-sep-2007 soft installed on 18 machines, partially configured, need to send signal from UPS_CHB1 and make sure it reaches all machines
- on Nerses's request: install S99caRepeater and S99logServer scripts, update corresponding procedure for Solaris (ask Paul Letta if necessary)
14-sep-2007: installed to /etc/rc3.d on clon01, but actial reboot was not tested
- fix colors for clasrun accounts (and others ?) on clons
- order 2 more discriminators for DVCSCAL
- on clon10 msql and rtserver must restart automatically on reboot
- nrpe did not restarted on clon06, it was in maintenance state; as clasrun I did disable and enable - it works now; probably network was down when it was trying to restart ?
- make sure emergency generator is by-passed if in service
26-sep-2007: sent mail to Bob Rice 26-sep-2007: I decided to submit request to inforce better power routing during emergency generator panel repair: normal power must be supplied during that time
Sergey Boyarinov's COMPLETED list
- replug AC power to emergency generators
17-sep-2007: done
- NTP servers on Solaris, update Solaris post-install page (and Linux on clon04)
24-sep-2007: done
- get TIBCO license:
17-sep-2007 requisition 266520 has been submitted temporary password received, changed to permanent; access to tibco web site works
25-sep-2007: done
- PCAL test setup (with Sergey P.)
14-sep-2007: everything seems working, except missing delay cables and wierd problem: FASTBUS hung on first event 26-sep-2007: replaced signal distr. card (TDC START did not work) 27-sep-2007: add 1 sec delay in 'Go' trabsition of fbrol1.c for standalone only, it seems fixed first-event-hung problem; everything seems Ok
27-sep-2007: done
- make paper 'DVCSCAL trigger system' and pass it to Chris and Ben
25-sep-2005 create new page DVCS Trigger System in clonweb wiki, document is there
25-sep-2007: done
- reply to Motorola about prpmc880
with Sergey P.: we asked Motorola for replacement, they offered PrPMC280-like replacement, Sergey on top of that, I'm done for now 28-sep-2007: done
- on clonweb wrong apache is starting on reboot (check nagios and mrtg as well)
30-sep-2007: done
- on upcoming run request ran 3 ethernet cables and 1 serial cable to the target cartrige; new name 'ioctstarg' was added to the tsconnect.conf; all connections are tested
1-oct-2007: done
- ET system debugging (with Carl)
clonpc7:/etc> ifconfig -a lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 inet 127.0.0.1 netmask 0xff000000 inet6 ::1 prefixlen 128 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 inet6 fe80::20d:93ff:fe65:ba7e%en0 prefixlen 64 scopeid 0x4 inet 129.57.68.7 netmask 0xffffff00 broadcast 129.57.68.255 inet 192.168.2.1 netmask 0xffffff00 broadcast 192.168.2.255 ether 00:0d:93:65:ba:7e media: autoselect (100baseTX <full-duplex>) status: active supported media: none autoselect 10baseT/UTP <half-duplex> 10baseT/UTP <full-duplex> 10baseT/UTP <full-duplex,hw-loopback> 100baseTX <half-duplex> 100baseTX <full-duplex> 100baseTX <full-duplex,hw-loopback> en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 ether 00:11:24:a1:d0:47 media: autoselect (<unknown type>) status: inactive supported media: autoselect fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 2030 lladdr 00:0d:93:ff:fe:65:ba:7e media: autoselect <full-duplex> status: inactive supported media: autoselect <full-duplex>
ET system was confused:
clonpc7:et> et_start -n 100 -s 10000 -f /tmp/test3 et_netinfo reached et_netinfo: fully qualified default hostname >clonpc7.jlab.org<
ifi->ifi_flags = 0xffff8049 LOOPBACK INTERFACE INTERFACE IS UP ifi->ifi_addr = 0x00300300 hptr = 0x00300190 addr_in->sin_addr = 127.0.0.1,
ifi->ifi_flags = 0xffff8863 INTERFACE IS UP ifi->ifi_addr = 0x00300350 hptr = 0x00300190 addr_in->sin_addr = 129.57.68.7,
ifi->ifi_flags = 0xffff8863 INTERFACE IS UP ifi->ifi_addr = 0x003003a0 hptr = 0x00000000 addr_in->sin_addr = 192.168.2.1,
et_netinfo: address = 129.57.68.7 et_netinfo: error in gethostbyaddr we've got 192.168.2.1, do not believe it is true et_netinfo: address = 129.57.68.7 removing file >/tmp/test3< file >/tmp/test3< removed et_udpreceive: port=11111 et_udpreceive: port=11112 et_udpreceive: port=11112 et_udpreceive: port=11112 ET user library >/usr/local/clas/devel/coda/Darwin_powerpc/lib/libet_user.so< will be used
to eliminate alias 192.168.2.1 following command was used:
ifconfig en0 -alias 192.168.2.1
Now it looks better:
clonpc7:/etc> et_start -n 100 -s 10000 -f /tmp/test3 et_netinfo reached et_netinfo: fully qualified default hostname >clonpc7.jlab.org<
ifi->ifi_flags = 0xffff8049 LOOPBACK INTERFACE INTERFACE IS UP ifi->ifi_addr = 0x00300300 hptr = 0x00300190 addr_in->sin_addr = 127.0.0.1,
ifi->ifi_flags = 0xffff8863 INTERFACE IS UP ifi->ifi_addr = 0x00300350 hptr = 0x00300190 addr_in->sin_addr = 129.57.68.7,
et_netinfo: address = 129.57.68.7 removing file >/tmp/test3< file >/tmp/test3< removed et_udpreceive: port=11111 et_udpreceive: port=11112 ET user library >/usr/local/clas/devel/coda/Darwin_powerpc/lib/libet_user.so< will be used
Oct-3-2007: got new version from Carl, test et_2_et: can connect through multiple ports to multiple ETs, BUT cannot connect between machines on different subnets. Tried DIRECT - does not work, ask Carl to check.
Oct-4-2007: pcal EB and ER are not working with new ET !!! Switch to old one, need check ...
Oct-9-2007: final check, everything looks good; add new option 'direct' to 'et_2_et', will be used for communication between machines which are not sharing any subnets, in that case only one ET system on remote machine can be connected as before; if machines are sharing at least one subnet, there is no limitations for the number of ET systems we can communicate with - good enough for CLON cluster
Oct-9-2007: done
- order extended maintenance kit for clonhp2
Oct-9-2007: done
tNetTask suspended problem debugging
TROUBLESHOOTING SIMPLE NETWORK PROBLEMS IN TORNADO 2.0 ------------------------------------------------------
If you are new to Tornado 2.0, or you are booting a target for the first time, or if the target booted successfully, but does not respond to pings, please read basicNetConfiguration article. basicNetConfiguration contains instructions on how to configure a kernel with a target shell and network debugging libraries. It contains information to avoid errors like:
rpccore backend client RPC: Timed out Network interface unknown muxDevLoad failed for device entry wdbConfig: error configuring WDB communication interface Error loading file: errno = 0xd0003
If target responded to pings initially, and after running your socket application (or a library that uses the network such as the FTP server), applications sputter or hang for a while and then resume, please read netPerformance article. It gives an overview of vxWorks network internals and configuration issues that impact performance.
This article assumes that a target has booted successfully and responds to pings initially. You are either running demo code from WindSuf->Demo Code->Networking or your own application.
At some point, the target does not respond to pings anymore, or network performance is slow. At this point from the target shell, verify that tNetTask is still running at priority 50.
If i shows tNetTask as suspended, call tt tNetTask.
There might be a reference to a null pointer somewhere, or an unitialized interrupt. If this error occurs with a default configuration, you have not added third party network drivers, or software, your application is not running, and you are just using WRS software on a supported BSP, please contact:
support@windriver.com.
If application runs fine, but as you increase the number of sockets or the packet rate, the application sputters or hangs, run the show functions described below from the target shell AFTER the target boots and AFTER the problem occurs.
->netStackSysPoolShow Network stack system memory pool status
->netStackDataPoolShow Network stack data memory pool status
->i
which tasks are running, show priorities
->inetstatShow
shows sockets and their receive queues
->ipstatShow
show IP protocol statistics
->tcpstatShow
show TCP protocol statistics
->udpstatShow
show UDP protocol statistics
->icmpstatShow
show ICMP protocol statistics
-->mRouteShow
show route table, including masks
->routeShow
show route table without masks
->iosFdShow
show file descriptors
->arpShow
show ARP table contents
->hostShow
show host table content
>ifShow
show interfaces attached to IP. This function will show END and BSD44 drivers.
->muxShow
show END drivers successfully loaded in the MUX
->endPoolShow("name", 0) END driver memory pool status of driver "name" For example: endPoolShow("fei", 0)
endPoolShow is not part of the API. This utility is found in utilities.c
SYMPTOMS:
A. ENOBUFS
SOCKET APPLICATION SPUTTERS
If the network stack or driver memory pool sizes are too small, applications will appear to hang and recover within minutes as resources are freed. Errors like ENOBUFS, or S_netBufLib_NO_POOL_MEMORY may occur.
The problem is most likely insufficient network stack and driver buffer pool. Call netStackSysPoolShow and netStackDataPoolShow. Increase the default allocation if there are 0 free buffers available. Increase the driver's buffer pool, and IP configuration parameters as indicated in netPerformance.
B. TARGET STOPS RESPONDING TO PINGS.
RPC CORE BACKEND TIMEOUT REPORTED IN TARGET SERVER WINDOW
1. APPLICATION/DRIVER BUFFER LOANING PROBLEMS
If sockets buffers are not being read, driver buffers will be depleted. As long as the driver is out of buffers, the driver will not be able to receive or send packets. Pings will be unanswered. Target server will be unable to contact the target.
muxShow displays END drivers only. If the boot device (or the other parameter if boot device is floppy, ata...) is listed in muxShow, the driver version used is END.
ifShow displays both END and BSD 4.4 drivers.
endPoolShow displays END driver's buffer pool status. If there are 0 free buffers, use inetstatShow to determine which socket has data in the receive queue. This information can be used to identify which task is not reading its sockets. See netPerformance for an explanation of network driver buffer loaning scheme.
2. DRIVER PROBLEMS
If inetstatShow shows no data in the receive queue, the problem may be due to a faulty driver:
If you are using the END driver version, try the BDS44 version (if available), and vice-versa. If the BSP supports more than 1 network card type, try a different card. If muxShow does not show any drivers, BSD44 drivers are configured. BSD44 drivers are configured if EITHER T101 build method is used, and INCLUDE_END is undefined in config.h OR in T2 Project facility:
network components->network devices->End attach interface and End interface support components are excluded.
See basicNetConfiguration document for information on configuring drivers. If changing drivers works, report the problem to support@windriver.com.
Check WindSurf->Resolve to check Problem Lists and Keyword Search.
If there are no other driver versions you can use, use the loopback driver. Run both the server and the client code in the target. Send to IP address 127.0.0.1.
C. TARGER RESPONDS TO PINGS, BUT DATA TRANSFER STOPS
If the target responds to pings, but data transfer stops, call inetstatShow on the target and the corresponding function (usually netstat) on the other host. If there is data backed up in the send side and on the receive side on the other side, most likely there is a deadlock situation within the client/server application code. Check if deadlock does not occur if you reduce the message size below 1460 bytes. (For TCP). TCP is a stream of bytes. Loops are required to read or write, since TCP can return partial reads and partial writes.
D. APPLICATION WORKS BUT NETWORK PERFORMANCE IS SLOW
See netPerformance article.
E. PROTOCOL PROBLEMS
Check the before and after protocol error statistics show routines for clues as to which error is increasing. Please read Section 1.4.1 of NetPerformance article for an explanation on UDP full sockets.
F. PRIORITY PROBLEMS
Make sure that tNetTask has a higher priority than application tasks depending on the network. Please read Section 1.2 of netPerformance article.
G. HARDWARE OR DRIVER MISCONFIGURATION PROBLEMS
1. Try changing the network card or using a different card. Examine hub's LED. Verify that the LED shows activity for the driver.
2. Configure the BSD44 driver version, if available. Call ifShow to determine if there are collisions, CRC ... The end driver does not report this information. If the driver reports excessive collisions make sure that another node in the network is not in full duplex mode.
3. If there are broadcast storms, verify that ifShow does not show driver is in PROMISCUOUS mode.
_extPart_000_01C0FFDE.1E130740
Content-Type: application/octet-stream; nametilities.c" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filenametilities.c"
/*********