rescue ProgressInterruption perform_later #. When the job starts (or restarts), we set an interrupt_at deadline: It’s tempting to look at that process description and think of “expiry” as a hard-stop via some kind of asynchronous timer process. Repeat steps 2 and 3 until the export finishes.The job is picked up again by another worker and resumes from where it left off.(Or, alternatively: a retryable exception is caught, and the job is automatically re-enqueued before exiting.) After some amount of time, the job expires, re-enqueues itself, and exits.The job is picked up by a worker and execution begins. The solution we hit on was to make these long-running export jobs interruptible. We fiddled a bit with the definition of “too long”, but no matter what threshold we set there, the long tail ensured that there would always be an account that would run longer. We allowed those orphaned jobs to run gracefully to conclusion…but with a caveat: if they took too long we would kill them. Each time we deployed Basecamp, we would hot-swap our Resque pool, which orphaned any active jobs. Ultimately, the real issue was how long the export took to run. Each failed export would be retried, sometimes repeatedly, and the worst offenders would have to be run manually in a console session. Over time, we had applied various bandaids-increasing memory limits, reducing the number of active workers, and so forth-which made things incrementally better, but the problem itself persisted. Some weeks we’d see more than a dozen of these. Most of the time they ran without a hitch, but for some customers with a lot of data the exports would randomly die, bubbling up in our on-call chat room as Resque::DirtyExit exceptions. It was August 2022, and export requests-asynchronous jobs that assemble, zip, and upload an offline-viewable version of a customer’s data-had become a bit of a headache.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |