IWGN dHTC F2F

US/Eastern
103 (Georgia Tech College of Computing Building)

103

Georgia Tech College of Computing Building

Peter Couvares (LIGO Lab / Caltech)
    • 09:00 12:00
      HTCondor-specific discussion / questions / issues 3h

      Just copypasted (for this block only) from Google doc to show how INDICO minutes look like - not the way you do it!

      Agenda

      Usability / UX Low-Hanging Fruit

      James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)
      data transfer state appearing in condor_q -dag (JIRA ticket)
      Apparently this is shown by condor_watch_q, but not for DAGs
      Cort: “Why I don’t use condor_watch_q? It’s slow to start, and fills my screen with jobs I don’t care about.”
      HTCONDOR-3517
      James: it sucks that there is still no job status code to distinguish running vs. xferring input (but there is for running and xferring output)
      Note there are attributes for TransferringInput and TransferringOutput and TransferredQueued if the transfer is in the xfer q
      Can htcondor credential list show the Vault token’s age or expiry time? It currently shows when the bearer token was last copied to jobs.
      Do we want to talk about UX for storing and managing Vault credentials?
      Cort is checkpointing between creation of output files. Is this reasonable enough to justify first-class support? A possible interface might be for transfer_checkpoint_files to also accept a list of lists, or to ignore files that don’t yet exist in the list it is given.
      Admin command to raise a user's priority or floor value for N core hours or days? Would make temporary increases easier.
      HTCONDOR-3518
      #170 — users want a condor_rm that preserves+returns interrupted jobs' latest output before exiting
      HTCONDOR-3519
      #183 — HTCondor vault token management on ldas-osg2 seems to be sending wrong file as bearer token

      Robustness

      What are the obstacles to having glideins use cgroup memory mgmt? For those sites where we control the AP and EP, can we do better?
      #195 — HA collectors & glideinWMS: lousy UX when secondary collector offline, questions whether HA collectors are worth the complexity
      #194 — Condor didn't transfer files when job was held (checkpointing/eviction behavior)
      Where did our idea of having first-class vs. wild-west resources, and auto-routing some users to each (with opt-in to others), stall?

      Capacity / Efficiency / Optimization

      Per user local caching of OSDF file transfers on EPs? [Todd says this is targeted for April]

      Notes

      James’s & Cort’s (others too?) opinionated short-list of the most valuable UI/UX improvements needed for DHTC power users (or heavy users)

      Cort: bias towards giving power users live & historical infrastructure & workflow metadata that they can interrogate/process/visualize, rather than packaging it up for me in a tool summary or batch report – because the infrastructure & usage is evolving/changing fast enough that the problem areas and bottlenecks will change over time, and any tool that chooses what to show me in advance will probably be blind to what I need to see tomorrow

      Top improvements for power users

      live & historical infrastructure & workflow metadata that they can interrogate/process/visualize
      Specifically?
      Provide a tool that takes the transfer data for a time period and prints out a summary of the activities (e.g. - “can you give me success rates broken down by cache?”), for a DAG.
      Prioritize tools that are run by power users asking a question, not deliver “here’s what happened yesterday”.
      Interested in the stacked concurrency of jobs broken down by state (e.g., 8 hours ago, how many jobs for this DAG were in pending / running / transferring). X-axis is time.
      Distinguish between “transfer input phase” in general and sub-states (transfer queued, CEDAR running, “OSDF transfer running”).
      ACTION: (Cole) Improve DAGMan to track transfer information
      ACTION: (Cole) Update htcondor dag status to display information about nodes/jobs transferring status information
      ACTION: (Cole) Add python API query to get current status directly from DAGMan
      ACTION: (Cole) Add -watch or -follow option to htcondor dag status
      ACTION: (Cole) Add per workflow epoch history file and create DAG time analysis of state (this could be done with the nodes.log?)
      ACTION: (Cole) Tool to convert while/until code statement into recursive DAG structure

      Top improvements for non- power users

      Josh had some ideas on showing whether workflows are “stuck” or not
      Mostly, what should a user be able to monitor that shows the workflow is making forward progress: enough jobs running (consistent with their priority, any floor/ceiling, and how rare the resources that match their jobs), how much time are those jobs spending idle/transferring input/running/transferring output. Presumably by “workflow”, where workflow can be a single DAG, the top-level DAG of a nested DAG, or a single submit file if there is no DAG.
      ACTION: (Todd T) condor_q checks parent PID for watch command to warn/recommend condor_watch_q

      James’s AX Improvements

      ACTION: during this f2f James to check whether Greg’s xfer in/out boolean is good enough for him to know job status = “xfer input files” in elasticsearch (and if not, action if for Greg to add something that will tell him)
      ACTION: Brian to agree to chaos monkey for OS pool CM -or- IGWN will migrate away from Condor CM HA

    • 12:00 13:30
      Lunch 1h 30m
    • 13:30 15:00
      Issues at the boundary of Pelican & Condor 1h 30m
    • 15:00 15:20
      Uset talks: BayesWave 20m
      Speaker: Shobhit Ranjan
    • 15:20 15:40
      User talks: gstlal / sgnl 20m
      Speaker: Urja Shah
    • 15:40 16:00
      User talks: DetChar on site resources 20m
      Speaker: Evan Goetz
    • 16:00 16:20
      User talks: RIFT 20m
      Speaker: Richard O'Shaughnessy
    • 16:20 17:00
      Discussion 40m
    • 08:00 08:20
      User talks: PyCBC 20m
    • 08:20 08:40
      Uset talks: CW 20m
      Speaker: Stefano Dal Pra
    • 08:40 12:00
      Discussion 3h 20m
    • 12:00 13:30
      Lunch 1h 30m
    • 13:30 14:00
      Ways in which IGWN is “weird” 30m
    • 14:00 15:00
      OSDF Cache Selection Issues 1h
    • 15:00 16:00
      OSDF / IGWN Stress-Test / Scalability Campaign 1h
    • 16:00 17:00
      How can/should DAGMan better handle workflow requirements (as opposed to job)? 1h
    • 09:00 12:00
      Increasing difficulty matching CPU/highmem and CPU/GPU jobs to resources without starvation, badput, or wasted resources. 3h
    • 12:00 13:30
      Lunch 1h 30m
    • 13:30 17:00
      Spillover / Misc Topics / Hackathons 3h 30m