TO DO Lists: Difference between revisions

From CLONWiki
Jump to navigation Jump to search
Boiarino (talk | contribs)
Boiarino (talk | contribs)
 
(4 intermediate revisions by the same user not shown)
Line 1: Line 1:
== Sergey Boyarinov's TODO list ==
== Sergey Boyarinov's TODO list ==


- Oct 24, 2023: test setups work (SAMPA, VMM, URWELL)


- Oct 26, 2023: EBDAQ^ crashed with message 'et_events_new returns ERROR = -8'; it was ETDAQ6 scan from 129.57.71.79 but unclear if it was in the same time;
TODO: use allow list (our subnets) instead of deny from .71. in dac.s/coda_component.c and dac.s/tcpServer.c, and print addresses of successful connections;
add dates whenever error message is printed;




=== OLD STUFF ==
 
 
=== OLD STUFF ===


- 29-mar-2010: send CLAS12 required CAEN HV list to Fernando for joined req - during a week
- 29-mar-2010: send CLAS12 required CAEN HV list to Fernando for joined req - during a week

Latest revision as of 14:33, 26 October 2023

Sergey Boyarinov's TODO list

- Oct 24, 2023: test setups work (SAMPA, VMM, URWELL)

- Oct 26, 2023: EBDAQ^ crashed with message 'et_events_new returns ERROR = -8'; it was ETDAQ6 scan from 129.57.71.79 but unclear if it was in the same time; TODO: use allow list (our subnets) instead of deny from .71. in dac.s/coda_component.c and dac.s/tcpServer.c, and print addresses of successful connections; add dates whenever error message is printed;



OLD STUFF

- 29-mar-2010: send CLAS12 required CAEN HV list to Fernando for joined req - during a week

- 14-aug-2009: buy new monitors to counting house

- event_monitor sometimes does not create correct histogram file at the end of run

- in some configurations where L1 not used, message 'L1 does not agree.." pops up from run sheet; need to resolve

- check routing table on clon10 and clon00, 'ssh bethsw1' does not work (see Brent's mail)

- need to fix runcontrol graphic problem; also will try to install NVIDIA driver on clon03, switch back clon03:0.0 and clon03:0.1, and set 1600x1200 resolution for clon01:0.1

- got following trying to ssh:

[boiarino@claspc7 ~]$ ssh clasrun@clon04
do_ypcall: clnt_call: RPC: Timed out
clasrun@clon04's password:

long delay was observed between first and second line. Need to understand.

- make internal links on clonweb wiki visible (no 'clonweb' in links, but what instead ?)

- add /usr/clas/archive automount to the Linux procedures (see clon04)

- found a solution for et_2_et_10_00 etc problem: we are running 3 et_2_et's on clon10, question is how our monitoring will distinguish them ? ps command cuts parameters, and process name et_2_et is the same for all of them; temporary et_2_et included in to 2 others (see $CODA/src/et/main).

- look at UEP15 NIM power supply from WIENER (plug into old NIM bin)

- decide on electronics inventory database and form of presentation - with Sergey P.

- CODA: use et_2_et everywhere (check byte swap!)

- respond on property (see email Oct-9-2007)

- there is an incompartibility on Solaris 10 update 2 with new Studio 12 (bosio does not compile for example), need to fix, maybe install Solaris 10 update 3 from scratch ? if so, clon10 and clon00 need extra work installing dns, nis, bootp, realport etc

- modify CMON and CED for new TOF counters (ask Heddle for CED)

- make datafile for DVCS Trigger simulation using BOS file from Rustam

Oct-9-2007: done, Rustam sent me pedestal file, I generated 10000 events ascii
  file for Ben, it is on DVCS Trigger wiki page, sent mail to Ben; program is
  $CODA/src/bosio/main/bosdvcstrig.c

- talk to Hovanes about adding new names 'lac' and 'dc' to the dictionary generation procedure (Makefile modifications, save/restore procedure etc); update wiki page accordingly

Oct-5-2007: Hovanes shown me how to do that, we did it in new EPICS, must write it
  to wiki and repeat in old epics

- buy 2 more A1520 500V CAEN boards on Stepan's request for hodoscope

Oct-8-2007: got quotation for A1520's, v288's and v895's

- arrange AC socket replacement on sy527's

- preshower_exit did not kill rcServer (probably daq_exit too) - need check

- old sy527 driver for VME-based caenet board

Sep-2007: need to pass to Nerses what I did;
Oct-1-2007: doing it myself with Nerses's help
Oct-4-2007: driver is finished, last puzzle was switching on floating point in vxWorks;
   everything works fine; remaining changes: make status bits consistent with sy1527s,
   check if 'input V' field must be filled in; merge with sy1527 driver; tune timing
   inside v288.c and sy527.c; check power/enable handling (currently 'dis'=Pw and PrON, 'Ena'=PrOn);
   check ID's (where it should be 0, where 4 etc)
Oct-9-2007: finish another round of tuning; add extra checks, particularly in v288Get calls, call v288Reset
   in case of transmission problems, fix some sleeps, set lower priority for main thread to let EPICS
   processes breeze; still have error messages, but ioc seems stable thanks to v288Reset;
TODO: make sure enable/disable and on/off logic works correctly (probably in v288.c)

- cleanup/replug/label BigIron switch, update MRTG names

- shutting down clasonl1 with command

shutdown -y -i 5 -g 120

got following

Shutdown started.    Fri Sep 28 16:20:49 EDT 2007
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:20:49...
The system clasonl1 will be shut down in 2 minutes
showmount: clasonl1: RPC: Program not registered
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:21:49...
The system clasonl1 will be shut down in 1 minute
showmount: clasonl1: RPC: Program not registered
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:22:19...
The system clasonl1 will be shut down in 30 seconds
showmount: clasonl1: RPC: Program not registered
Broadcast Message from root (pts/28) on clasonl1 Fri Sep 28 16:22:39...
THE SYSTEM clasonl1 IS BEING SHUT DOWN NOW ! ! !
Log off now or risk your files being damaged
showmount: clasonl1: RPC: Program not registered
Changing to init state 5 - please wait
clasonl1:/root>

Need to understand.

- v1190/v1290 testing: CBLT does not work, wrong slot number

- runcontrol does not compiles on clon10, but compiles on clon03 - new studio12 etc - need to install Sol10 U3 everywhere !; runcontrol hungs in 'go' for TDC_CALIB - working on it

- write notes on standalone DAQ operations (enable signal necessity etc)

- equipment list with DB, lebles (with Sergey P.)

- JLAB discriminators: ask Volker to push it

28-set-2007: sent email to Volker, he replied asking for extra info, sent email with requested info

- need 1881M ADCs, at least few modules

- test sy527 which arrived from repair (with George Jacobs)

28-sep-2007: unit tested, boots fine, serial works; George will test alarm output, Sergey - remote control
28-sep-2007: AC connectors arrived, George will arrange replacement (or me ?); box with connectors in
  counting room on the table

- buy labels for both labeling machines

17-sep-2007 requisitions 266507 and 266519 have been submitted
26-sep-2007 labels for panel labler received

- learn, start and test auto-shutdown software on clons

29-sep-2007 soft installed on 18 machines, partially configured, need to send signal from UPS_CHB1 and make sure
it reaches all machines

- on Nerses's request: install S99caRepeater and S99logServer scripts, update corresponding procedure for Solaris (ask Paul Letta if necessary)

14-sep-2007: installed to /etc/rc3.d on clon01, but actial reboot was not tested

- fix colors for clasrun accounts (and others ?) on clons

- order 2 more discriminators for DVCSCAL

- on clon10 msql and rtserver must restart automatically on reboot

- nrpe did not restarted on clon06, it was in maintenance state; as clasrun I did disable and enable - it works now; probably network was down when it was trying to restart ?

- make sure emergency generator is by-passed if in service

26-sep-2007: sent mail to Bob Rice
26-sep-2007: I decided to submit request to inforce better power routing during emergency generator panel repair: normal
power must be supplied during that time

Sergey Boyarinov's COMPLETED list

- replug AC power to emergency generators

17-sep-2007: done

- NTP servers on Solaris, update Solaris post-install page (and Linux on clon04)

24-sep-2007: done

- get TIBCO license:

17-sep-2007 requisition 266520 has been submitted
temporary password received, changed to permanent; access to tibco web site works
25-sep-2007: done

- PCAL test setup (with Sergey P.)

14-sep-2007: everything seems working, except missing delay cables and wierd
problem: FASTBUS hung on first event
26-sep-2007: replaced signal distr. card (TDC START did not work)
27-sep-2007: add 1 sec delay in 'Go' trabsition of fbrol1.c for standalone only, it seems
fixed first-event-hung problem; everything seems Ok
27-sep-2007: done

- make paper 'DVCSCAL trigger system' and pass it to Chris and Ben

25-sep-2005 create new page DVCS Trigger System in clonweb wiki, document is there
25-sep-2007: done

- reply to Motorola about prpmc880

with Sergey P.: we asked Motorola for replacement, they offered PrPMC280-like replacement, Sergey on top of that, I'm done
for now

28-sep-2007: done

- on clonweb wrong apache is starting on reboot (check nagios and mrtg as well)

30-sep-2007: done

- on upcoming run request ran 3 ethernet cables and 1 serial cable to the target cartrige; new name 'ioctstarg' was added to the tsconnect.conf; all connections are tested

1-oct-2007: done


- ET system debugging (with Carl)

clonpc7:/etc> ifconfig -a
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
       inet 127.0.0.1 netmask 0xff000000 
       inet6 ::1 prefixlen 128 
       inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
       inet6 fe80::20d:93ff:fe65:ba7e%en0 prefixlen 64 scopeid 0x4 
       inet 129.57.68.7 netmask 0xffffff00 broadcast 129.57.68.255
       inet 192.168.2.1 netmask 0xffffff00 broadcast 192.168.2.255
       ether 00:0d:93:65:ba:7e 
       media: autoselect (100baseTX <full-duplex>) status: active
       supported media: none autoselect 10baseT/UTP <half-duplex> 10baseT/UTP <full-duplex> 10baseT/UTP 
   <full-duplex,hw-loopback> 100baseTX <half-duplex> 100baseTX <full-duplex> 100baseTX <full-duplex,hw-loopback>
en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
       ether 00:11:24:a1:d0:47 
       media: autoselect (<unknown type>) status: inactive
       supported media: autoselect
fw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 2030
       lladdr 00:0d:93:ff:fe:65:ba:7e 
       media: autoselect <full-duplex> status: inactive
       supported media: autoselect <full-duplex>

ET system was confused:

clonpc7:et> et_start -n 100 -s 10000 -f /tmp/test3
et_netinfo reached
et_netinfo: fully qualified default hostname >clonpc7.jlab.org<
ifi->ifi_flags = 0xffff8049
LOOPBACK INTERFACE
INTERFACE IS UP
ifi->ifi_addr = 0x00300300
hptr = 0x00300190
addr_in->sin_addr = 127.0.0.1,
ifi->ifi_flags = 0xffff8863
INTERFACE IS UP
ifi->ifi_addr = 0x00300350
hptr = 0x00300190
addr_in->sin_addr = 129.57.68.7,
ifi->ifi_flags = 0xffff8863
INTERFACE IS UP
ifi->ifi_addr = 0x003003a0
hptr = 0x00000000
addr_in->sin_addr = 192.168.2.1,
et_netinfo: address = 129.57.68.7
et_netinfo: error in gethostbyaddr
we've got 192.168.2.1, do not believe it is true
et_netinfo: address = 129.57.68.7
removing file >/tmp/test3<
file >/tmp/test3< removed
et_udpreceive: port=11111
et_udpreceive: port=11112
et_udpreceive: port=11112
et_udpreceive: port=11112
ET user library >/usr/local/clas/devel/coda/Darwin_powerpc/lib/libet_user.so< will be used


to eliminate alias 192.168.2.1 following command was used:

ifconfig en0 -alias 192.168.2.1

Now it looks better:

clonpc7:/etc> et_start -n 100 -s 10000 -f /tmp/test3
et_netinfo reached
et_netinfo: fully qualified default hostname >clonpc7.jlab.org<
ifi->ifi_flags = 0xffff8049
LOOPBACK INTERFACE
INTERFACE IS UP
ifi->ifi_addr = 0x00300300
hptr = 0x00300190
addr_in->sin_addr = 127.0.0.1,
ifi->ifi_flags = 0xffff8863
INTERFACE IS UP
ifi->ifi_addr = 0x00300350
hptr = 0x00300190
addr_in->sin_addr = 129.57.68.7,
et_netinfo: address = 129.57.68.7
removing file >/tmp/test3<
file >/tmp/test3< removed
et_udpreceive: port=11111
et_udpreceive: port=11112
ET user library >/usr/local/clas/devel/coda/Darwin_powerpc/lib/libet_user.so< will be used

Oct-3-2007: got new version from Carl, test et_2_et: can connect through multiple ports to multiple ETs, BUT cannot connect between machines on different subnets. Tried DIRECT - does not work, ask Carl to check.

Oct-4-2007: pcal EB and ER are not working with new ET !!! Switch to old one, need check ...

Oct-9-2007: final check, everything looks good; add new option 'direct' to 'et_2_et', will be used for communication between machines which are not sharing any subnets, in that case only one ET system on remote machine can be connected as before; if machines are sharing at least one subnet, there is no limitations for the number of ET systems we can communicate with - good enough for CLON cluster

Oct-9-2007: done

- order extended maintenance kit for clonhp2

Oct-9-2007: done

tNetTask suspended problem debugging

      TROUBLESHOOTING SIMPLE NETWORK PROBLEMS IN TORNADO 2.0
      ------------------------------------------------------

If you are new to Tornado 2.0, or you are booting a target for the first time, or if the target booted successfully, but does not respond to pings, please read basicNetConfiguration article. basicNetConfiguration contains instructions on how to configure a kernel with a target shell and network debugging libraries. It contains information to avoid errors like:

  rpccore backend client RPC: Timed out
  Network interface unknown
  muxDevLoad failed for device entry
  wdbConfig: error configuring WDB communication interface
  Error loading file: errno =  0xd0003

If target responded to pings initially, and after running your socket application (or a library that uses the network such as the FTP server), applications sputter or hang for a while and then resume, please read netPerformance article. It gives an overview of vxWorks network internals and configuration issues that impact performance.

This article assumes that a target has booted successfully and responds to pings initially. You are either running demo code from WindSuf->Demo Code->Networking or your own application.

At some point, the target does not respond to pings anymore, or network performance is slow. At this point from the target shell, verify that tNetTask is still running at priority 50.

If i shows tNetTask as suspended, call tt tNetTask.

There might be a reference to a null pointer somewhere, or an unitialized interrupt. If this error occurs with a default configuration, you have not added third party network drivers, or software, your application is not running, and you are just using WRS software on a supported BSP, please contact:

  support@windriver.com.

If application runs fine, but as you increase the number of sockets or the packet rate, the application sputters or hangs, run the show functions described below from the target shell AFTER the target boots and AFTER the problem occurs.

->netStackSysPoolShow
Network stack system memory pool status
->netStackDataPoolShow
Network stack data memory pool status

->i

 which tasks are running, show priorities

->inetstatShow

 shows sockets and their receive queues

->ipstatShow

 show IP protocol statistics 

->tcpstatShow

 show TCP protocol statistics 

->udpstatShow

 show UDP protocol statistics 

->icmpstatShow

 show ICMP protocol statistics 

-->mRouteShow

 show route table, including masks 

->routeShow

 show route table without masks  

->iosFdShow

 show file descriptors

->arpShow

 show ARP table contents

->hostShow

 show host table content

>ifShow

show interfaces attached to IP.  This function will show
END and BSD44 drivers.

->muxShow

 show END drivers successfully loaded in the MUX

->endPoolShow("name", 0) END driver memory pool status of driver "name" For example: endPoolShow("fei", 0)

 endPoolShow is not part of the API.  This utility is
 found in utilities.c

SYMPTOMS:

A. ENOBUFS

  SOCKET APPLICATION SPUTTERS
  If the network stack or driver memory pool sizes are too small,
  applications will appear to hang and recover within minutes as
  resources are freed.  Errors like ENOBUFS, or
  S_netBufLib_NO_POOL_MEMORY may occur.
  The problem is most likely insufficient network stack and
  driver buffer pool.  Call netStackSysPoolShow and
  netStackDataPoolShow.  Increase the default allocation if
  there are 0 free buffers available. Increase the driver's
  buffer pool, and IP configuration parameters as indicated
  in netPerformance.

B. TARGET STOPS RESPONDING TO PINGS.

  RPC CORE BACKEND TIMEOUT REPORTED IN TARGET SERVER WINDOW

1. APPLICATION/DRIVER BUFFER LOANING PROBLEMS

  If sockets buffers are not being read, driver buffers will be
  depleted.  As long as the driver is out of buffers, the driver
  will not be able to receive or send packets.  Pings will be
  unanswered.  Target server will be unable to contact the
  target.
  muxShow displays END drivers only.  If the boot device (or the
  other parameter if boot device is floppy, ata...) is  listed
  in muxShow, the driver version used is END.
  ifShow displays both END and BSD 4.4 drivers.
  endPoolShow displays END driver's buffer pool status.  If
  there are 0 free buffers, use inetstatShow to determine which
  socket has data in the receive queue.  This information can
  be used to identify which task is not reading its sockets.
  See netPerformance for an explanation of network driver
  buffer loaning scheme.

2. DRIVER PROBLEMS

  If inetstatShow shows no data in the receive queue, the
  problem may be due to a faulty driver:
  If you are using the END driver version, try the BDS44 version
  (if available), and vice-versa.  If the BSP supports more than
  1 network card type, try a different card.  If muxShow does not
  show any drivers, BSD44 drivers are configured.  BSD44 drivers
  are configured if EITHER T101 build method is used, and
  INCLUDE_END is undefined in config.h OR in T2 Project
  facility:
  network components->network devices->End attach interface and
  End interface support components are excluded.
  See basicNetConfiguration document for information on
  configuring drivers.  If changing drivers works, report
  the problem to support@windriver.com.
  Check WindSurf->Resolve to check Problem Lists and Keyword
  Search.
  If there are no other driver versions you can use, use the
  loopback driver.  Run both the server and the client code in
  the target.  Send to IP address 127.0.0.1.

C. TARGER RESPONDS TO PINGS, BUT DATA TRANSFER STOPS

  If the target responds to pings, but data transfer stops, call
  inetstatShow on the target and the corresponding function
  (usually netstat) on the other host.  If there is data backed
  up in the send side and on the receive side on the other side,
  most likely there is a deadlock situation within the
  client/server application code.  Check if deadlock does not
  occur if you reduce the message size below 1460 bytes.
  (For TCP).  TCP is a stream of bytes.  Loops are required
  to read or write, since TCP can return partial reads and
  partial writes.

D. APPLICATION WORKS BUT NETWORK PERFORMANCE IS SLOW

  See netPerformance article.

E. PROTOCOL PROBLEMS

   Check the before and after protocol error statistics show
   routines for clues as to which error is increasing.  Please
   read Section 1.4.1 of NetPerformance article for an
   explanation on UDP full sockets.

F. PRIORITY PROBLEMS

Make sure that tNetTask has a higher priority than application tasks depending on the network. Please read Section 1.2 of netPerformance article.

G. HARDWARE OR DRIVER MISCONFIGURATION PROBLEMS

 1. Try changing the network card or using a different card.
    Examine hub's LED.  Verify that the LED shows activity for
    the driver. 
 2. Configure the BSD44 driver version, if available.  Call
    ifShow to determine if there are collisions, CRC ... The end
    driver does not report this information.  If the driver
    reports excessive collisions make sure that another node in
    the network is not in full duplex mode.
 3. If there are broadcast storms, verify that ifShow does not
    show driver is in PROMISCUOUS mode.



_extPart_000_01C0FFDE.1E130740

Content-Type: application/octet-stream; nametilities.c" Content-Transfer-Encoding: quoted-printable Content-Disposition: attachment; filenametilities.c"

/*********