Move2silo: Difference between revisions

Latest revision as of 13:07, 5 December 2021

Data moving process initiated by cronjob, running from 'clascron' account, for example for HPS experiment on clondaq5:

0,10,20,30,40,50 * * * * /usr/local/scicomp/jasmine/bin/jmigrate /data/stage5 /data/stage5 /mss/hallb/hps/physrun2019/data -jvm:-Dfile.transfer.client.displayrates=true

and for CLAS12 run group A on clondaq6:

0,10,20,30,40,50 * * * * /usr/local/scicomp/jasmine/bin/jmigrate /data/stage6 /data/stage6 /mss/clas12/rg-a/data -jvm:-Dfile.transfer.client.displayrates=true

Cronjob files for different run periods can be found in ~clascron/backup/ directory.

OLD

In auto.master make sure the auto.direct is mounted without timeout:

/-      /etc/auto.direct  --timeout 0

If changed, restart autofs (on RHEL7 do 'service autofs restart'). To forcibly unmount do 'umount -lf /xxx/yyy'.

Following have to be in auto.direct:

/lustre/scicomp/jasmine/fairy2 -fstype=nfs,rw,async,vers=3 scidaqgw10b:/lustre/scicomp/jasmine/fairy2

Following cronjob have to be running as user 'clascron' on machine moving data to tape:

# Scan for to-tape files every 15 minutes
10,25,40,55 * * * * /usr/local/scicomp/jasmine/bin/jmigrate /data/totape /data/totape /mss/clas12/er-a/data -jvm:-Dfile.transfer.client.displayrates=true
# access occasionally to keep it visible
* * * * * /bin/csh -c "(ls -al /lustre/scicomp/jasmine/fairy2/) >>&! /usr/logs/disks/clondaq6_lustre"
* * * * * /bin/csh -c "(sleep 48; rm -f /usr/logs/disks/clondaq6_lustre) >>&! /dev/null"

Log files are in

/usr/local/scicomp/jasmine/log/jmigrate/data-totape/.

If job stuck, remove the lock file:

rm /tmp/jmigrate-data-totape.lock

Useful command to check process status:

ps auxf | grep java

If see something like

clascron 128847  0.0  1.0 7307240 662088 ?      D    Dec14   0:00  \_ java -DJMirror.minFileModif.........

job marked as 'D' is in uninterruptable state and cannot not be kiiled by 'kill -9'. Other stuck jobs can be killed, they became <defunct>.

To see files on tape, ssh to ifarm65 and type:

ls -ltrh /mss/clas12/er-a/data/ | tail

To see files still in cache:

ls -ltrh /cache/mss/clas12/er-a/data/ | tail

To retrieve from the tape :

jget /mss/clas12/er-a/data/clas_00XXXX.evio.Y #copy to the current directory
jget /mss/clas12/er-a/data/clas_00XXXX.evio.Y /path/to/dir #copy to the directory with path
jget /mss/clas12/er-a/data/clas_00XXXX.evio.Y /mss/clas12/er-a/data/clas_00XXXX.evio.Z /path/to/dir # copy two files

Mover's status can be checked at https://scicomp.jlab.org/scicomp/moverStatus

Job's schedule can be checked at https://scicomp.jlab.org/scicomp/tapeJob/scheduled

NOTE: problems were observed in Dec 2017 moving data with NFS-mounted scidaqgw10b or scidaqgw10f. To fix stuched job, jcancel have to be issued on movers side. When NFS mount was removed and process started to use network socket, everything works.

@@ Line 1: / Line 1: @@
+Data moving process initiated by cronjob, running from 'clascron' account, for example for HPS experiment on clondaq5:
+,10,20,30,40,50 * * * * /usr/local/scicomp/jasmine/bin/jmigrate /data/stage5 /data/stage5 /mss/hallb/hps/physrun2019/data -jvm:-Dfile.transfer.client.displayrates=true
+and for CLAS12 run group A on clondaq6:
+,10,20,30,40,50 * * * * /usr/local/scicomp/jasmine/bin/jmigrate /data/stage6 /data/stage6 /mss/clas12/rg-a/data -jvm:-Dfile.transfer.client.displayrates=true
+Cronjob files for different run periods can be found in ''~clascron/backup/'' directory.
+'''OLD'''
 In auto.master make sure the auto.direct is mounted without timeout:
@@ Line 17: / Line 32: @@
   * * * * * /bin/csh -c "(sleep 48; rm -f /usr/logs/disks/clondaq6_lustre) >>&! /dev/null"
-Log files are in /usr/local/scicomp/jasmine/log/jmigrate/data-totape/.
+Log files are in
+ /usr/local/scicomp/jasmine/log/jmigrate/data-totape/.
 If job stuck, remove the lock file:
@@ Line 31: / Line 47: @@
   clascron 128847  0.0  1.0 7307240 662088 ?      D    Dec14   0:00  \_ java -DJMirror.minFileModif.........
-then job is in uninterruptable state and cannot not be kiiled by 'kill -9'. Other stuck jobs can be killed, they became <defunct>.
+job marked as 'D' is in uninterruptable state and cannot not be kiiled by 'kill -9'. Other stuck jobs can be killed, they became <defunct>.
+To see files on tape, ssh to ifarm65 and type:
+ ls -ltrh /mss/clas12/er-a/data/ | tail
+To see files still in cache:
+ ls -ltrh /cache/mss/clas12/er-a/data/ | tail
+To retrieve from the tape :
+ jget /mss/clas12/er-a/data/clas_00XXXX.evio.Y #copy to the current directory
+ jget /mss/clas12/er-a/data/clas_00XXXX.evio.Y /path/to/dir #copy to the directory with path
+ jget /mss/clas12/er-a/data/clas_00XXXX.evio.Y /mss/clas12/er-a/data/clas_00XXXX.evio.Z /path/to/dir # copy two files
+Mover's status can be checked at https://scicomp.jlab.org/scicomp/moverStatus
+Job's schedule can be checked at https://scicomp.jlab.org/scicomp/tapeJob/scheduled
+'''NOTE''': problems were observed in Dec 2017 moving data with NFS-mounted scidaqgw10b or scidaqgw10f. To fix stuched job, jcancel have to be issued on movers side. When NFS mount was removed and process started to use network socket, everything works.

Move2silo: Difference between revisions

Latest revision as of 13:07, 5 December 2021

Navigation menu

Search