Skip to content

[ENHANCEMENT] retries on failed rule tasks / catching clusterIO timeout fails as timeouts #852

@barentine

Description

@barentine

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when we miss a frame or two on a series in localization analysis due to a dataserver bog-down and clusterIO timeout. Similarly, I get a little disappointed if a node hits an error on a task that another node wouldn't fail (e.g. memory error, spoiled GPU context,

Describe the solution you'd like

  • catch socket timeouts as timeouts and retry them, or more generally:
  • allow retries on task failures not just timeouts

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions