The goal of the exercise is to get acquainted with few of the statistical analysis practices used in HEP. Introductory slides will present a general refresher on statistics, as well as the main features of RooFit and RooStats, and will give some additional information on the hands-on session.
This exercises use the CMS public data from the 2010 proton-proton running. In particular, the invariant mass spectrum of muon pairs from data is provided. The purpose of the session is to approach the discovery of a "new" particle in the context of the observation of the ψ(2S) resonance in the CMS data. No specific knowledge of the CMS apparatus is needed. This is what the distribution of the dimuon invariant mass spectrum looks like for about 2% of the total 2010 statistics:
The J/ψ is of course very visible around the 3.1 GeV mass point. We expect to see an excess around 3.65 GeV for the ψ(2S). Is it there? Let's investigate!
Do notice that usually experiments do not proceed in this way towards a discovery. In general, blind analyses are performed and the methodology and modelization for data interpretation is decided before looking at the actual data. For the purpose of describing the statistical tools in this lecture, we will neglect these considerations for now.
The exercise session introduces methods and tools for tackling very common problems in statistics while complying with rigorous standards set in today's experimental particle physics. The main topics are finding a confidence interval for a parameter of interest, and estimating the statistical significance of an excess of events with respect to the background expectations. We cover the most popular techniques: profile likelihood, Bayesian. Participants will learn the main features of the RooFit and RooStats software frameworks for statistics and data modeling. We overview the popular cases of a hunting a signal peak over a predictable falling background.
All exercises should run on any ROOT installation containing also the RooFit libraries (RooStats is included in the RooFit installation). On your laptop you obtain this by passing the --enable-roofit on the configure step of the ROOT installation. If you have a debian based installation, you just need to install the libRooFit library. All the code here uses PyROOT, which is a python interface to ROOT libraries, so a python installation is also required.
Alternatively, if you have a CERN account, most of the central ROOT installations contain RooFit.
PyROOT programs can be ran using the following command from the shell prompt:
python macro.py
In order to have all ROOT libraries available in PyROOT, the following line should be added at the begin of every program (macro.py, in the previous command line example):
import ROOT
First of all, you will need a copy of the data ROOT file containing the invariant mass information from dimuon events. We will start with a data sample which corresponds to about 20% of the statistics available on 2010:
wget http://pellicci.web.cern.ch/pellicci/DataSet_lowstat.root
Let's create a skeleton for a PyROOT program. Let's call it, for instance, exercise_0.py, and load the ROOT libraries:
#Import the ROOT libraries import ROOT
fInput = ROOT.TFile("DataSet_lowstat.root") dataset = fInput.Get("data")
#The observable mass = ROOT.RooRealVar("mass","#mu^{+}#mu^{-} invariant mass",2.,6.,"GeV")
#The Jpsi signal parametrization: we'll use a Crystal Ball meanJpsi = ROOT.RooRealVar("meanJpsi","The mean of the Jpsi Gaussian",3.1,2.8,3.2) sigmaJpsi = ROOT.RooRealVar("sigmaJpsi","The width of the Jpsi Gaussian",0.3,0.0001,1.) alphaJpsi = ROOT.RooRealVar("alphaJpsi","The alpha of the Jpsi Gaussian",1.5,-5.,5.) nJpsi = ROOT.RooRealVar("nJpsi","The alpha of the Jpsi Gaussian",1.5,0.5,5.)
CBJpsi = ROOT.RooCBShape("CBJpsi","The Jpsi Crystall Ball",mass,meanJpsi,sigmaJpsi,alphaJpsi,nJpsi)
#The psi(2S) signal parametrization: width will be similar to Jpsi core of the CB (almost Gaussian) meanpsi2S = ROOT.RooRealVar("meanpsi2S","The mean of the psi(2S) Gaussian",3.7,3.65,3.75) gausspsi2S = ROOT.RooGaussian("gausspsi2S","The psi(2S) Gaussian",mass,meanpsi2S,sigmaJpsi)
#Background parametrization: just a polynomial a1 = ROOT.RooRealVar("a1","The a1 of background",-0.7,-2.,2.) a2 = ROOT.RooRealVar("a2","The a2 of background",0.3,-2.,2.) a3 = ROOT.RooRealVar("a3","The a3 of background",-0.03,-2.,2.) backgroundPDF = ROOT.RooChebychev("backgroundPDF","The background PDF",mass,ROOT.RooArgList(a1,a2,a3))
#Define the yields NJpsi = ROOT.RooRealVar("NJpsi","The Jpsi signal events",1500.,0.1,10000.) Nbkg = ROOT.RooRealVar("Nbkg","The bkg events",5000.,0.1,50000.)
#Now define the number of psi(2S) events as a product of crss section*efficiency*luminosity (in pb) #Let's assume we measured the trigger, reconstruction and identification efficiency for dimuons and found it to be 95% #Lowstat sample has 0.64 pb-1 #Fullstat sample has 37 pb-1 eff_psi = ROOT.RooRealVar("eff_psi","The psi efficiency",0.75,0.00001,1.) lumi_psi = ROOT.RooRealVar("lumi_psi","The CMS luminosity",0.64,0.00001,50.,"pb-1") cross_psi = ROOT.RooRealVar("cross_psi","The psi xsec",3.,0.,40.,"pb") #Now define the number of psi events Npsi = ROOT.RooFormulaVar("Npsi","@0*@1*@2",ROOT.RooArgList(eff_psi,lumi_psi,cross_psi)) #Important! We cannot fit simultaneously the efficiency, the luminosity and the cross section (our total PDF is only sensitive on the product of the three) #We need to fix two of them, so we'll keep our POI floating #One can also add an additional PDF to give predictive power on the other two parameters (later) eff_psi.setConstant(1) lumi_psi.setConstant(1)
#Compose the total PDF totPDF = ROOT.RooAddPdf("totPDF","The total PDF",ROOT.RooArgList(CBJpsi,gausspsi2S,backgroundPDF),ROOT.RooArgList(NJpsi,Npsi,Nbkg))
#Do the actual fit totPDF.fitTo(dataset, ROOT.RooFit.Extended(1)) #Print values of the parameters (that now reflect fitted values and errors) print "##############" meanpsi2S.Print() NJpsi.Print() Npsi.Print() print "##############"
#Now plot the data and the fit result xframe = mass.frame() dataset.plotOn(xframe) totPDF.plotOn(xframe) #One can also plot the single components of the total PDF, like the background component totPDF.plotOn(xframe, ROOT.RooFit.Components("backgroundPDF"), ROOT.RooFit.LineStyle(ROOT.kDashed), ROOT.RooFit.LineColor(ROOT.kRed)) #Draw the results c1 = ROOT.TCanvas() xframe.Draw() c1.SaveAs("exercise_0.png")
#Now save the data and the PDF into a Workspace, for later use for statistical analysis ws = ROOT.RooWorkspace("ws") getattr(ws,'import')(dataset) getattr(ws,'import')(totPDF) fOutput = ROOT.TFile("Workspace_mumufit.root","RECREATE") ws.Write() fOutput.Write() fOutput.Close()
The entire exercise is available below:
import ROOT
#Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print()
#You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty ws.var("meanJpsi").setConstant(1)
#Set the RooModelConfig and let it know what the content of the workspace is about model = ROOT.RooStats.ModelConfig() model.SetWorkspace(ws) model.SetPdf("totPDF")
#Here we explicitly set the value of the parameters for the null hypothesis #We want no signal contribution, so cross_psi = 0 cross_psi = ws.var("cross_psi") poi = ROOT.RooArgSet(cross_psi) nullParams = poi.snapshot() nullParams.setRealValue("cross_psi",0.)
#Build the profile likelihood calculator plc = ROOT.RooStats.ProfileLikelihoodCalculator() plc.SetData(ws.data("data")) plc.SetModel(model) plc.SetParameters(poi) plc.SetNullParameters(nullParams)
#We get a HypoTestResult out of the calculator, and we can query it. htr = plc.GetHypoTest() print "-------------------------------------------------" print "The p-value for the null is ", htr.NullPValue() print "Corresponding to a signifcance of ", htr.Significance() print "-------------------------------------------------" #PyROOT sometimes fails cleaning memory, this helps del plc
Run this exercise. Is the peak aound 3.65 GeV significant?
The complete exercise is available below:
import ROOT #Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print() #You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty #Right now we will fix all the nuisance parameters just to speed up the computing time ws.var("meanJpsi").setConstant(1) ws.var("sigmaJpsi").setConstant(1) ws.var("alphaJpsi").setConstant(1) ws.var("nJpsi").setConstant(1) ws.var("NJpsi").setConstant(1) ws.var("meanpsi2S").setConstant(1) ws.var("Nbkg").setConstant(1) ws.var("a1").setConstant(1) ws.var("a2").setConstant(1) ws.var("a3").setConstant(1)
#Configure the model, we need both the S+B and the B only models sbModel = ROOT.RooStats.ModelConfig() sbModel.SetWorkspace(ws) sbModel.SetPdf("totPDF") sbModel.SetName("S+B Model") poi = ROOT.RooArgSet(ws.var("cross_psi")) poi.find("cross_psi").setRange(0.,40.) #this is mostly for plotting sbModel.SetParametersOfInterest(poi) bModel = sbModel.Clone() bModel.SetPdf("totPDF") bModel.SetName( sbModel.GetName() + "_with_poi_0") poi.find("cross_psi").setVal(0) bModel.SetSnapshot(poi)
#First example is with a frequentist approach fc = ROOT.RooStats.FrequentistCalculator(ws.data("data"), bModel, sbModel) fc.SetToys(2500,1500) #Create hypotest inverter passing the desired calculator calc = ROOT.RooStats.HypoTestInverter(fc) #set confidence level (e.g. 95% upper limits) calc.SetConfidenceLevel(0.95) #use CLs calc.UseCLs(1) #reduce the noise calc.SetVerbose(0) #Configure ToyMC Sampler toymcs = calc.GetHypoTestCalculator().GetTestStatSampler() #Use profile likelihood as test statistics profll = ROOT.RooStats.ProfileLikelihoodTestStat(sbModel.GetPdf()) #for CLs (bounded intervals) use one-sided profile likelihood profll.SetOneSided(1) #set the test statistic to use for toys toymcs.SetTestStatistic(profll) npoints = 8 #Number of points to scan # min and max for the scan (better to choose smaller intervals) poimin = poi.find("cross_psi").getMin() poimax = poi.find("cross_psi").getMax() print "Doing a fixed scan in interval : ", poimin, " , ", poimax calc.SetFixedScan(npoints,poimin,poimax); result = calc.GetInterval() #This is a HypoTestInveter class object upperLimit = result.UpperLimit()
#Example using the BayesianCalculator #Now we also need to specify a prior in the ModelConfig #To be quicker, we'll use the PDF factory facility of RooWorkspace #Careful! For simplicity, we are using a flat prior, but this doesn't mean it's the best choice! ws.factory("Uniform::prior(cross_psi)") sbModel.SetPriorPdf(ws.pdf("prior")) #Construct the bayesian calculator bc = ROOT.RooStats.BayesianCalculator(ws.data("data"), sbModel) bc.SetConfidenceLevel(0.95) bc.SetLeftSideTailFraction(0.) # for upper limit bcInterval = bc.GetInterval()
#Now let's print the result of the two methods #First the CLs print "################" print "The observed CLs upper limit is: ", upperLimit #Compute expected limit print "Expected upper limits, using the B (alternate) model : " print " expected limit (median) ", result.GetExpectedUpperLimit(0) print " expected limit (-1 sig) ", result.GetExpectedUpperLimit(-1) print " expected limit (+1 sig) ", result.GetExpectedUpperLimit(1) print "################" #Now let's see what the bayesian limit is print "Bayesian upper limit on cross_psi = ", bcInterval.UpperLimit()
#Plot now the result of the scan #First the CLs freq_plot = ROOT.RooStats.HypoTestInverterPlot("HTI_Result_Plot","Frequentist scan result for psi xsec",result) #Then the Bayesian posterior bc_plot = bc.GetPosteriorPlot() #Plot in a new canvas with style dataCanvas = ROOT.TCanvas("dataCanvas") dataCanvas.Divide(2,1) dataCanvas.SetLogy(0) dataCanvas.cd(1) freq_plot.Draw("2CL") dataCanvas.cd(2) bc_plot.Draw() dataCanvas.SaveAs("exercise_2.png")
The complete exercise is available below:
wget http://pellicci.web.cern.ch/pellicci/DataSet.root
import ROOT #Open the rootfile and get the workspace from the exercise_0 fInput = ROOT.TFile("Workspace_mumufit.root") ws = fInput.Get("ws") ws.Print() #You can set constant parameters that are known #If you leave them floating, the fit procedure will determine their uncertainty ws.var("meanJpsi").setConstant(1) ws.var("sigmaJpsi").setConstant(1) ws.var("alphaJpsi").setConstant(1) ws.var("nJpsi").setConstant(1) ws.var("NJpsi").setConstant(1) ws.var("meanpsi2S").setConstant(1) ws.var("Nbkg").setConstant(1) ws.var("a1").setConstant(1) ws.var("a2").setConstant(1) ws.var("a3").setConstant(1) #Let the model know what is the parameter of interest cross_psi = ws.var("cross_psi") cross_psi.setRange(4., 16.) #this is mostly for plotting reasons poi = ROOT.RooArgSet(cross_psi)
#Configure the model model = ROOT.RooStats.ModelConfig() model.SetWorkspace(ws) model.SetPdf("totPDF") model.SetParametersOfInterest(poi) #Set confidence level confidenceLevel = 0.68
#Build the profile likelihood calculator plc = ROOT.RooStats.ProfileLikelihoodCalculator() plc.SetData(ws.data("data")) plc.SetModel(model) plc.SetParameters(poi) plc.SetConfidenceLevel(confidenceLevel) #Get the interval pl_Interval = plc.GetInterval()
#Now let's determine the Bayesian probability interval #We could use the standard Bayesian Calculator, but this would be very slow for the integration #So we profit of the Markov-Chain MC capabilities of RooStats to speed things up mcmc = ROOT.RooStats.MCMCCalculator(ws.data("data") , model) mcmc.SetConfidenceLevel(confidenceLevel) mcmc.SetNumIters(20000) #Metropolis-Hastings algorithm iterations mcmc.SetNumBurnInSteps(100) #first N steps to be ignored as burn-in mcmc.SetLeftSideTailFraction(0.5) #for central interval MCMC_interval = mcmc.GetInterval()
#Let's make a plot dataCanvas = ROOT.TCanvas("dataCanvas") dataCanvas.Divide(2,1) dataCanvas.cd(1) plot_Interval = ROOT.RooStats.LikelihoodIntervalPlot(pl_Interval) plot_Interval.SetTitle("Profile Likelihood Ratio") plot_Interval.SetMaximum(3.) plot_Interval.Draw() dataCanvas.cd(2) plot_MCMC = ROOT.RooStats.MCMCIntervalPlot(MCMC_interval) plot_MCMC.SetTitle("Bayesian probability interval (Markov Chain)") plot_MCMC.Draw() dataCanvas.SaveAs("exercise_3.png")
#Now print the interval for mH for the two methods print "PLC interval is [", pl_Interval.LowerLimit(cross_psi), ", ", pl_Interval.UpperLimit(cross_psi), "]" print "Bayesian interval is [", MCMC_interval.LowerLimit(cross_psi), ", ", MCMC_interval.UpperLimit(cross_psi), "]" #PyROOT sometimes fails cleaning memory, this helps del plc
The complete exercise is available here:
--
- 2018-12-17I | Attachment | History | Action | Size | Date | Who | Comment |
---|---|---|---|---|---|---|---|
CMSDAS_Stat_2019.pdf | r1 | manage | 2886.8 K | 2019-01-28 - 10:43 | Slides to support the exercises | ||
root | DataSet.root | r1 | manage | 127.1 K | 2018-12-17 - 11:07 | Full 2010 statistics dataset for the dimuon mass distribution | |
root | DataSet_lowstat.root | r1 | manage | 9.9 K | 2018-12-17 - 11:07 | Low statistics CMS dataset for the dimuon mass distribution | |
png | Data_lowstat.png | r1 | manage | 13.0 K | 2018-12-17 - 11:08 | Low statistics data for the dimuon mass spectrum | |
png | exercise_0.png | r1 | manage | 14.0 K | 2018-12-17 - 11:08 | Fit to the low statistics dimuon mass spectrum | |
txt | exercise_0.py.txt | r2 r1 | manage | 3.5 K | 2019-01-28 - 09:31 | Exercise 0: fit the dimuon invariant mass spectrum | |
txt | exercise_1.py.txt | r1 | manage | 1.3 K | 2018-12-17 - 11:21 | Exercise 1: calculate the statistical significance of the excess | |
png | exercise_2.png | r1 | manage | 17.0 K | 2018-12-17 - 11:29 | Comparison of different upper limit calculations | |
txt | exercise_2.py.txt | r2 r1 | manage | 3.8 K | 2018-12-17 - 11:47 | Exercise 2: determine the upper limit on Nsig in the low statistics case | |
png | exercise_3.png | r1 | manage | 12.6 K | 2018-12-17 - 14:03 | Intervals on the psi cross section | |
txt | exercise_3.py.txt | r1 | manage | 2.5 K | 2018-12-17 - 14:03 | Exercise 3: determine the interval on the psi cross section with full stat |