Just copypasted (for this block only) from Google doc to show how INDICO minutes look like - not the way you do it!
Agenda
Usability / UX Low-Hanging Fruit
James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)
data transfer state appearing in condor_q -dag (JIRA ticket)
Apparently this is shown by condor_watch_q, but not for DAGs
Cort: “Why I don’t use condor_watch_q? It’s slow to start, and fills my screen with jobs I don’t care about.”
HTCONDOR-3517
James: it sucks that there is still no job status code to distinguish running vs. xferring input (but there is for running and xferring output)
Note there are attributes for TransferringInput and TransferringOutput and TransferredQueued if the transfer is in the xfer q
Can htcondor credential list show the Vault token’s age or expiry time? It currently shows when the bearer token was last copied to jobs.
Do we want to talk about UX for storing and managing Vault credentials?
Cort is checkpointing between creation of output files. Is this reasonable enough to justify first-class support? A possible interface might be for transfer_checkpoint_files to also accept a list of lists, or to ignore files that don’t yet exist in the list it is given.
Admin command to raise a user's priority or floor value for N core hours or days? Would make temporary increases easier.
HTCONDOR-3518
#170 — users want a condor_rm that preserves+returns interrupted jobs' latest output before exiting
HTCONDOR-3519
#183 — HTCondor vault token management on ldas-osg2 seems to be sending wrong file as bearer token
Robustness
What are the obstacles to having glideins use cgroup memory mgmt? For those sites where we control the AP and EP, can we do better?
#195 — HA collectors & glideinWMS: lousy UX when secondary collector offline, questions whether HA collectors are worth the complexity
#194 — Condor didn't transfer files when job was held (checkpointing/eviction behavior)
Where did our idea of having first-class vs. wild-west resources, and auto-routing some users to each (with opt-in to others), stall?
Capacity / Efficiency / Optimization
Per user local caching of OSDF file transfers on EPs? [Todd says this is targeted for April]
Notes
James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)
Cort: bias towards giving power users live & historical infrastructure & workflow metadata that they can interrogate/process/visualize, rather than packaging it up for me in a tool summary or batch report – because the infrastructure & usage is evolving/changing fast enough that the problem areas and bottlenecks will change over time, and any tool that chooses what to show me in advance will probably be blind to what I need to see tomorrow
Top improvements for power users
live & historical infrastructure & workflow metadata that they can interrogate/process/visualize
Specifically?
Provide a tool that takes the transfer data for a time period and prints out a summary of the activities (e.g. - “can you give me success rates broken down by cache?”), for a DAG.
Prioritize tools that are run by power users asking a question, not deliver “here’s what happened yesterday”.
Interested in the stacked concurrency of jobs broken down by state (e.g., 8 hours ago, how many jobs for this DAG were in pending / running / transferring). X-axis is time.
Distinguish between “transfer input phase” in general and sub-states (transfer queued, CEDAR running, “OSDF transfer running”).
ACTION: (Cole) Improve DAGMan to track transfer information
ACTION: (Cole) Update htcondor dag status to display information about nodes/jobs transferring status information
ACTION: (Cole) Add python API query to get current status directly from DAGMan
ACTION: (Cole) Add -watch or -follow option to htcondor dag status
ACTION: (Cole) Add per workflow epoch history file and create DAG time analysis of state (this could be done with the nodes.log?)
ACTION: (Cole) Tool to convert while/until code statement into recursive DAG structure
Top improvements for non- power users
Josh had some ideas on showing whether workflows are “stuck” or not
Mostly, what should a user be able to monitor that shows the workflow is making forward progress: enough jobs running (consistent with their priority, any floor/ceiling, and how rare the resources that match their jobs), how much time are those jobs spending idle/transferring input/running/transferring output. Presumably by “workflow”, where workflow can be a single DAG, the top-level DAG of a nested DAG, or a single submit file if there is no DAG.
ACTION: (Todd T) condor_q checks parent PID for watch command to warn/recommend condor_watch_q
James’s AX Improvements
ACTION: during this f2f James to check whether Greg’s xfer in/out boolean is good enough for him to know job status = “xfer input files” in elasticsearch (and if not, action if for Greg to add something that will tell him)
ACTION: Brian to agree to chaos monkey for OS pool CM -or- IGWN will migrate away from Condor CM HA