IWGN dHTC F2F

Name: IWGN dHTC F2F
Start: 2026-01-27T09:00:00-05:00
End: 2026-01-29T17:00:00-05:00
Location: Georgia Tech College of Computing Building

27 Jan 2026, 09:00 → 29 Jan 2026, 17:00 US/Eastern

103 (Georgia Tech College of Computing Building)

103

Georgia Tech College of Computing Building

Peter Couvares (LIGO Lab / Caltech)

Description

Zoom link: https://caltech.zoom.us/j/83057465840

Tuesday 27 January
- 09:00 → 12:00
  
  HTCondor-specific discussion / questions / issues 3h
  
  Just copypasted (for this block only) from Google doc to show how INDICO minutes look like - not the way you do it!
  
  Agenda
  
  Usability / UX Low-Hanging Fruit
  
  James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)
  data transfer state appearing in condor_q -dag (JIRA ticket)
  Apparently this is shown by condor_watch_q, but not for DAGs
  Cort: “Why I don’t use condor_watch_q? It’s slow to start, and fills my screen with jobs I don’t care about.”
  HTCONDOR-3517
  James: it sucks that there is still no job status code to distinguish running vs. xferring input (but there is for running and xferring output)
  Note there are attributes for TransferringInput and TransferringOutput and TransferredQueued if the transfer is in the xfer q
  Can htcondor credential list show the Vault token’s age or expiry time? It currently shows when the bearer token was last copied to jobs.
  Do we want to talk about UX for storing and managing Vault credentials?
  Cort is checkpointing between creation of output files. Is this reasonable enough to justify first-class support? A possible interface might be for transfer_checkpoint_files to also accept a list of lists, or to ignore files that don’t yet exist in the list it is given.
  Admin command to raise a user's priority or floor value for N core hours or days? Would make temporary increases easier.
  HTCONDOR-3518
  #170 — users want a condor_rm that preserves+returns interrupted jobs' latest output before exiting
  HTCONDOR-3519
  #183 — HTCondor vault token management on ldas-osg2 seems to be sending wrong file as bearer token
  
  Robustness
  
  What are the obstacles to having glideins use cgroup memory mgmt? For those sites where we control the AP and EP, can we do better?
  #195 — HA collectors & glideinWMS: lousy UX when secondary collector offline, questions whether HA collectors are worth the complexity
  #194 — Condor didn't transfer files when job was held (checkpointing/eviction behavior)
  Where did our idea of having first-class vs. wild-west resources, and auto-routing some users to each (with opt-in to others), stall?
  
  Capacity / Efficiency / Optimization
  
  Per user local caching of OSDF file transfers on EPs? [Todd says this is targeted for April]
  
  Notes
  
  James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)
  
  Cort: bias towards giving power users live & historical infrastructure & workflow metadata that they can interrogate/process/visualize, rather than packaging it up for me in a tool summary or batch report – because the infrastructure & usage is evolving/changing fast enough that the problem areas and bottlenecks will change over time, and any tool that chooses what to show me in advance will probably be blind to what I need to see tomorrow
  
  Top improvements for power users
  
  live & historical infrastructure & workflow metadata that they can interrogate/process/visualize
  Specifically?
  Provide a tool that takes the transfer data for a time period and prints out a summary of the activities (e.g. - “can you give me success rates broken down by cache?”), for a DAG.
  Prioritize tools that are run by power users asking a question, not deliver “here’s what happened yesterday”.
  Interested in the stacked concurrency of jobs broken down by state (e.g., 8 hours ago, how many jobs for this DAG were in pending / running / transferring). X-axis is time.
  Distinguish between “transfer input phase” in general and sub-states (transfer queued, CEDAR running, “OSDF transfer running”).
  ACTION: (Cole) Improve DAGMan to track transfer information
  ACTION: (Cole) Update htcondor dag status to display information about nodes/jobs transferring status information
  ACTION: (Cole) Add python API query to get current status directly from DAGMan
  ACTION: (Cole) Add -watch or -follow option to htcondor dag status
  ACTION: (Cole) Add per workflow epoch history file and create DAG time analysis of state (this could be done with the nodes.log?)
  ACTION: (Cole) Tool to convert while/until code statement into recursive DAG structure
  
  Top improvements for non- power users
  
  Josh had some ideas on showing whether workflows are “stuck” or not
  Mostly, what should a user be able to monitor that shows the workflow is making forward progress: enough jobs running (consistent with their priority, any floor/ceiling, and how rare the resources that match their jobs), how much time are those jobs spending idle/transferring input/running/transferring output. Presumably by “workflow”, where workflow can be a single DAG, the top-level DAG of a nested DAG, or a single submit file if there is no DAG.
  ACTION: (Todd T) condor_q checks parent PID for watch command to warn/recommend condor_watch_q
  
  James’s AX Improvements
  
  ACTION: during this f2f James to check whether Greg’s xfer in/out boolean is good enough for him to know job status = “xfer input files” in elasticsearch (and if not, action if for Greg to add something that will tell him)
  ACTION: Brian to agree to chaos monkey for OS pool CM -or- IGWN will migrate away from Condor CM HA
- 12:00 → 13:30
  
  Lunch 1h 30m
- 13:30 → 15:00
  
  Issues at the boundary of Pelican & Condor 1h 30m
- 15:00 → 15:20
  
  Uset talks: BayesWave 20m
  
  Speaker: Shobhit Ranjan
  
  Slides
- 15:20 → 15:40
  
  User talks: gstlal / sgnl 20m
  
  Speaker: Urja Shah
  
  Slides
- 15:40 → 16:00
  
  User talks: DetChar on site resources 20m
  
  Speaker: Evan Goetz
  
  Slides in DCC
- 16:00 → 16:20
  
  User talks: RIFT 20m
  
  Speaker: Richard O'Shaughnessy
- 16:20 → 17:00
  
  Discussion 40m
Wednesday 28 January
- 08:00 → 08:20
  
  User talks: PyCBC 20m
- 08:20 → 08:40
  
  Uset talks: CW 20m
  
  Speaker: Stefano Dal Pra
  
  Slides
- 08:40 → 12:00
  
  Discussion 3h 20m
- 12:00 → 13:30
  
  Lunch 1h 30m
- 13:30 → 14:00
  
  Ways in which IGWN is “weird” 30m
- 14:00 → 15:00
  
  OSDF Cache Selection Issues 1h
- 15:00 → 16:00
  
  OSDF / IGWN Stress-Test / Scalability Campaign 1h
- 16:00 → 17:00
  
  How can/should DAGMan better handle workflow requirements (as opposed to job)? 1h
Thursday 29 January
- 09:00 → 12:00
  
  Increasing difficulty matching CPU/highmem and CPU/GPU jobs to resources without starvation, badput, or wasted resources. 3h
- 12:00 → 13:30
  
  Lunch 1h 30m
- 13:30 → 17:00
  
  Spillover / Misc Topics / Hackathons 3h 30m

Choose timezone

IWGN dHTC F2F

103

Georgia Tech College of Computing Building

Agenda

Usability / UX Low-Hanging Fruit

Robustness

Capacity / Efficiency / Optimization

Notes

James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)

Top improvements for power users

Top improvements for non- power users

James’s AX Improvements