Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when we miss a frame or two on a series in localization analysis due to a dataserver bog-down and clusterIO timeout. Similarly, I get a little disappointed if a node hits an error on a task that another node wouldn't fail (e.g. memory error, spoiled GPU context,
Describe the solution you'd like
- catch socket timeouts as timeouts and retry them, or more generally:
- allow retries on task failures not just timeouts
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when we miss a frame or two on a series in localization analysis due to a dataserver bog-down and clusterIO timeout. Similarly, I get a little disappointed if a node hits an error on a task that another node wouldn't fail (e.g. memory error, spoiled GPU context,
Describe the solution you'd like