Bug 397776

Summary:	Aborted job with given startup time not restarted
Product:	[Applications] kstars	Reporter:	Wolfgang Reissenberger <wreissen>
Component:	general	Assignee:	Jasem Mutlaq <mutlaqja>
Status:	RESOLVED WORKSFORME
Severity:	normal	CC:	eric.dejouhanet
Priority:	NOR
Version First Reported In:	2.9.8
Target Milestone:	---
Platform:	unspecified
OS:	Unspecified
Latest Commit:		Version Fixed/Implemented In:
Sentry Crash Report:

Description Wolfgang Reissenberger 2018-08-23 08:14:19 UTC

A SchedulerJob with a given startup-time, that aborts for example because guiding temporarily fails, will be set to "invalid" as soon as the scheduler tries to restart it.

It seems like there is a misinterpretation what "start at" means. Currently, it defines a short timeframe of startup time + max 5 min when the job might start. 

My expectation would be that beginning from the defined time, the job might be executed. Only if the "repeat until" time has expired, it should be set to invalid.

The problem is that the job starts correctly within this timeframe, but after running for a while an event happens to abort it. When the scheduler reaches "evaluateJobs()" in an attempt to restart it, the startup time has already passed for a while and leads to the state "invalid".

Changing the behavior as described would make such jobs much more robust against temporarily failures. Currently, when an abort happens after the 5 min, the job will never start again.

Comment 1 TallFurryMan 2018-08-23 10:44:54 UTC

I agree partially. The original use case for START_AT is that the job must start at the given date, not that the job can start from the given date. Therefore when the start date is in the past, the job is made invalid.

I didn't change the implementation of this use case when refactoring the scheduler, because it is valid in the case the observation has to be made at a certain time and not another. My opinion is that you found that "to be made at a certain time" is incorrectly implemented.

Let's consider the case of a START_AT job with no repeat.

At the very least and generally, I think that if the job was running before a transitory error it should be continued when the error disappears. The Scheduler should not reconsider all the jobs and potentially choose another one. But I would expect this to work currently, as I don't think the Scheduler is changing currentJob until it switches to another job.

Now the real bug might be that if the START_AT job is aborted, re-evaluated, and is still in a relevant timeframe compared to its estimated time (say the job takes 4 hours and is re-evaluated after 1 hour), the Scheduler should not consider that this job is invalid.

If we add the repeat configuration into the equation, it affects the estimated time, so the previous paragraph is still valid. The only edge case is LOOP_INFINITE, where there is no estimated time.

I agree with your suggestion to implement a new behavior for START_AT jobs for multiple reasons, but I would prefer we solve all the issues with the current design first.

Comment 2 Wolfgang Reissenberger 2018-08-23 20:26:24 UTC

OK, I see, works as designed, and I fully agree, we should not add more complexity to the current scheduler.

Meanwhile, I found a scheduler setup that is more robust and leads to the same capturing behavior.

Instead of defining one job with an end time and another one with this time as start time, I use 
- the first job with a defined end time
- a second job with a higher priority value and ASAP as startup time.

Since the first has higher priority, it will be executed until it reaches the end time. The second one won't start unless the end time has passed.

Both jobs will restart in case that they have been aborted - so we have a more robust job schedule than before.

I will set the status to "resolved/worksforme"