You are here

Handling outputs

Content:

  1. Introduction
  2. Addressed use-cases
  3. Client-side options
    1. Enable save-to-file technology
    2. Saving merged objects to an output file
    3. Creating a dataset
    4. Summary of option keywords
  4. Server-side options
  5. Summary of parameters and environment variables

1. Introduction

By default PROOF, the output of a PROOF query is kept in memory and available via the output list. It is known that for large outputs this can be problematic. The solution proposed by PROOF to this problem is to use files to swap objects from memory; the files can be either merged at the end or accessed via a global common view via a dataset.This technology provides indeed a good solution to the problem; however, it turned out to be difficult to setup for the average user.

To simplify access to this technology in particular - and to output handling more in general - a new set of options have been added to TProof::Process. These new options are the subject of these pages. They are available in the trunk (PROOF-Lite support starting from r45632) and in the 5.32 and 5.34 patch branches, starting from tags 5.34/02 and 5.32/05 .

2. Addressed use-cases

One of the more frequent PROOF user questions is how to save to file the results of a run. This is not strictly connected to the technology used to handle the output but more with the fact the Terminate() method is not much used, if not to save the results to a file. Therefore a quick way to define a file where to save the results without having to re-implement the same code in each TSelector would certainly be welcome by may users. Another missing functionality is the possibility to save the partial results while processing , so that in case of a crash, not all is lost and can be partially recovered.

The options described in this page allow to control via the option field of TProof::Process the following cases:

  1. Define an output file where to save all the objects which are not already saved in other output files;
  2. Decide if the merging process to create the output file happens in memory or via file;
  3. Decide if partial results have to saved after each packet
  4. Give the possibility to the cluster administrator to control file-saving by setting a memory threshold above which object swapping to file is done whatever the user's setting will be;  
  5. In alternative to file merging, give the possibility to create a dataset with the files created on the nodes; the user can then decide what to do with the dataset.

 

3. Client-side options

3.1. Enable save-to-file technology

The keyword 'stf' or 'savetofile' can be used in the option field to force merging via file. By default the final file is kept in the user data directory on the master. For example, this is how the 'ProofSimple' tutorial looks like when this option is passed:


root [1] p->SetParameter("ProofSimple_NHist", (Long_t) 16) // NB: ProofSimple_NHist is needed by the ProofSimple tutorial, not by the safe-to-file functionality
root [2] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "stf")
Mst-0: merging output objects ... done                                     
Output file: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/output-cernvm24-1343918756-15791.q1.root
Mst-0: grand total: sent 8 objects, size: 1405 bytes                            
ntuple opts: 0  0
(Long64_t)0

Internally, PROOF creates a TProofOutputFile object and adds it to the output list:

root [3] p->GetOutputList()->Print()
Collection name='TList', class='TList', size=1
Info in <:print>: -------------- output-cernvm24-1343918756-15791.q1.root : start (cernvm28.cern.ch:1093) ------------
Info in <:print>:  dir:              rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/
Info in <:print>:  raw dir:          /home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/
Info in <:print>:  file name:        output-cernvm24-1343918756-15791.q1.root
Info in <:print>:  run type:         create a merged file
Info in <:print>:  merging option:   keep remote
Info in <:print>:  output file name: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1343918756-15791/output-cernvm24-1343918756-15791.q1.root
Info in <:print>:  ordinal:          0
Info in <:print>: -------------- output-cernvm24-1343918756-15791.q1.root : done -------------

The method TProof::GetOutput() can be used to access transparently the output objects: if the searched object is not found in the output list and the output list contains TProofOutputFile objects, these files are opened and searched for the object; this is what ProofSimple::Terminate does to make the tutorial behaviour unchanged.

 

3.2. Saving merged objects to an output file

It is possible to specify a file where to save the output object with the keywords 'of' or 'outfile'; for example


root [1] p->SetParameter("ProofSimple_NHist", (Long_t) 16) // NB: ProofSimple_NHist is needed by the ProofSimple tutorial, not by the safe-to-file functionality
root [2] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "of=test.root")
Mst-0: merging output objects ... done                                     
Mst-0: grand total: sent 23 objects, size: 15506 bytes                            
ntuple opts: 0  0
 Output saved to test.root
(Long64_t)0

The specified file path is interpreted fro the client machine and can be also a full URL. If 'master:' is prefixed then the path is interpreted from the master machine:

root [3] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "of=master:test.root")
Info in : unmodified script has already been compiled and loaded
Mst-0: merging output objects ... done                                     
Output file: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1344002021-23918/test.root
Mst-0: grand total: sent 8 objects, size: 1005 bytes                            
ntuple opts: 0  0
(Long64_t)0

By default file are created in the user data directory on the master; a TProofOutputFile is always sent back to the user with the location of the file and the URL to open it remotely. If the option 'stf' is specified (or if the server side settings enforce it) then merging goes via file and the location of the intermediate file is notified via TProofOutputFile:

root [5] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "of=test.root;stf")
Info in : unmodified script has already been compiled and loaded
Mst-0: merging output objects ... done                                     
Output file: rootd://proofadm@cernvm24.cern.ch:1093//home/proofadm/PEAC/proof/proofbox/ganis/data/0/cernvm24-1344002021-23918/test.root
Mst-0: grand total: sent 8 objects, size: 1281 bytes                            
ntuple opts: 0  0
[TFile::Cp] Total 0.02 MB       |====================| 100.00 % [1.9 MB/s]
 Output successfully copied to test.root
(Long64_t)0

 

3.3. Creating a dataset

To create a dataset use the keyword 'ds':


root [6] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "ds=testds")
Info in : unmodified script has already been compiled and loaded
Registering dataset 'testds' ... OK (1 workers still sending)    
Mst-0: merging output objects ... done                                     
Mst-0: grand total: sent 8 objects, size: 26539 bytes                            
ntuple opts: 0  0
(Long64_t)0

Option 'safe-to-file' is enforced in this case. By default the dataset is only registered; to force verification add '|V':

root [7] p->Process("tutorials/proof/ProofSimple.C+", 40000000, "ds|V")
Info in : unmodified script has already been compiled and loaded
Registering dataset 'dataset_cernvm24-1344002021-23918_q6' ... OK
Mst-0: merging output objects ... done                                     
Mst-0: grand total: sent 8 objects, size: 26569 bytes                            
ntuple opts: 0  0
Collection name='TList', class='TList', size=12
 Collection name='FeedbackList', class='TList', size=0
  TParameter      ProofSimple_NHist = 16
 OBJ: TNamed    PROOF_DefaultOutputOption       ds:dataset_
  TParameter       PROOF_SavePartialResults = 1
 OBJ: TNamed    PROOF_QueryTag  session-cernvm24-1344002021-23918:q6
 OBJ: TNamed    PROOF_FilesToProcess    dataset:dataset_cernvm24-1344002021-23918_q6
 OBJ: TNamed    PROOF_Packetizer        TPacketizerFile
 OBJ: TNamed    PROOF_VerifyDataSet     dataset_cernvm24-1344002021-23918_q6
 OBJ: TNamed    PROOF_VerifyDataSetOption
  TParameter       PROOF_IncludeFileInfoInPacket = 1
 OBJ: TNamed    PROOF_MSS
 OBJ: TNamed    PROOF_StageOption
Registering dataset 'dataset_cernvm24-1344002021-23918_q6' ... OK
Mst-0: merging output objects ... done                                     
Mst-0: grand total: sent 102 objects, size: 29567 bytes                            
Info in <:verifydataset>: dataset_cernvm24-1344002021-23918_q6: changed? 1 (# files opened = 24, # files touched = 0, # missing files = 0)
(Long64_t)0

In this example we see that is is not necessary to specify a name for the dataset: if missing, the name will be in the form 'dataset__q'.

 

3.4 Summary of option keywords

The client-side keywords described in this section are summarized in Table 1.

Table 1. Client-side keywords
 Long name Short Description  Subsection 
safetofile[=opt] stf[=opt]  Control saving of partial results to file; the optional opt field
 is in the form o1*10 + o0 with
     o0 = 0  save if required by admin
             1  force saving
     o1 = 1  save after each packet
             0  save at query end
 Default is opt = 1 when the keyword is specified (0 if not
 specified). 
3.1
outfile=fileout of=fileout  Enables saving the query output to file fileout. The path
 (which could be a full URL) is interpreted from the client
 session unless it starts with 'master', in which case it is 
 created from the master session.
 Using 'of=master' saves the results in the master data
 directory with name
       .q.root
 If not specified, this option is internally set to 'master' in
 the case the administrator forces saving to file.
3.2
dataset[=name] ds[=name]  Enables creation of a dataset out of the saved files.
 The list of files is also returned in the output list in the form
 of a TFileCollection. The dataset is registered under name
 if registration is allowed by the administrator. The dataset is
 also verified if the name field contains '|V'; the sequence
 '|V' is in such a case removed from the final dataset name.
 The name is set to 'dataset__q
 if not specified. 
3.3

 

4.Server-side options

 

 

5. Summary of parameters and environment variables

The parameters and RC environment variables affecting the options defined in this section are summarized in Table 2.

Table 2. Parameters and RC variables
Name Type Description Remarks
PROOF_DefaultOutputOption Param  String parameter used internally to  pass the output
 file or the dataset name; in the first case it is in the
 form 'of:fileout', while for datasets it has the form
 'ds:name
 
PROOF_SavePartialResults Param  Int_t parameter containing the 'safetofile' option  
ProofPlayer.SavePartialResults RC var  As PROOF_SavePartialResults  Server side only
ProofPlayer.SaveMemThreshold RC var  Float_t parameter defining the threshold to force file
 saving; it is expressed as fraction of physical memory
 per core
 Server side only