






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A table of results for flow time and completion time for different months. The data includes the number of days for each month for flow time and completion time. The table covers the months from July to May.
Typology: Lecture notes
1 / 10
This page cannot be seen from the preview
Don't miss anything!
Abstract This pap er analyzes job scheduling for parallel computers by using theoretical and exp erimental means Based on exist ing architectures we rst present a machine and a job mo del Then we prop ose a simple online algorithm employing job preemption without migration and derive theoretical b ounds for the p erformance of the algorithm The algorithm is ex p erimentally evaluated with trace data from a large comput ing facility These exp eriments show that the algorithm is highly sensitive on parameter selection and that substantial p erformance improvements over existing nonpreemptive scheduling metho ds are p ossible
Intro duction
To days massively parallel computers are built to exe cute a large numb er of dierent and indep endent jobs with a varying degree of parallelism For reasons of ef ciency most of these architectures allow space sharing ie the concurrent execution of jobs with little paral lelism on disjoint no de sets This pro duces a complex management task with the scheduling problem ie the assignment of jobs to no des and time slots b eing a cen tral part Also in a typical workload it can neither b e assumed that all jobs have the same prop erties nor that there is a random distribution of jobs with dierent prop erties see eg Feitelson and Nitzb erg As most scheduling problems have b een shown to b e computationally hard theoretical research has cen tered around pro ofs of NPcompleteness approximation algorithms and heuristic metho ds to obtain optimal so lutions At the same time the exp erimental evaluation of various scheduling heuristics the study of workload characteristics and consideration of architectural con straints have b een the fo cus of applied research in this area But the interaction b etween b oth groups has b een rather limited For instance so far very few algorithms from the theoretical community have b een implemented within real schedulers Similarly heuristics used in real parallel systems have rarely b een sub ject of a theoreti cal analysis Various reasons are frequently cited to b e resp onsible for the lack of interaction b etween b oth com
(^) Supp orted by a grant from the NRW Metacomputing pro ject y (^) Computer Engineering Institute University Dortmund Dortmund Germany uwedsetechnikunidortmundde z (^) Computer Engineering Institute University Dortmund Dortmund Germany yahyadsetechnikunidortmundde
munities For instance designers of commercial sched ulers do not care much ab out approximation factors For them deviations of factor or factor from an optimal solution will usually b oth b e unacceptable for real workloads Also applied researchers often claim that the mo dels and optimality criteria used in theoret ical research rarely match the restrictions of many real life situations On the other hand as the worst case b e havior of those heuristics typically found in commercial schedulers is usually quite bad with resp ect to the crite ria often used in theoretical research these algorithms are of little interest to theoreticians In our pap er we want to demonstrate that there are algorithmic issues in job scheduling where theoretical and applied research can b oth contribute to a solution To this end we discuss Firstcomerstserve FCFS scheduling a simple job scheduling metho d which can often b e found in real systems but is usually consid ered inadequate by many researchers First we derive a job scheduling mo del based on the IBM RS SP as describ ed by Hotovy and address scheduling ob jectives Then we show that for many workloads go o d p erformance can b e exp ected from an FCFS schedule Moreover bad utilization of the parallel computer can b e prevented by intro ducing gang scheduling into FCFS We also demonstrate that the term fairness can b e trans formed into a sp ecic selection of job weights Using this weight selection we can prove the rst constant com p etitive factor for general online weighted completion time scheduling where submission times and job exe cution times are unknown Finally we use simulation exp eriments with real workloads to determine go o d and bad strategies for FCFS scheduling with preemption As the p erformance of the strategies is highly dep endent on the workload we conclude that an adaptive scheduling strategy may pro duce the b est results in job scheduling for parallel computers
The Mo del Our mo del is based on the IBM RS SP parallel computer We assume a massively parallel computer consisting of R identical no des Each no de contains one or more pro cessors main memory and lo cal hard disks while there is no shared memory Equal access from all no des to mass storage is provided Fast
communication b etween the no des is achieved via a sp ecial interconnection network This network do es not prioritize clustering of some subsets of no des over others as in a hyp ercub e or a mesh Therefore the parallel computer allows free variable partitioning that is the resource requirement ri of a job i can b e fullled by each no de subset of sucient size Further the execution time hi of job i do es not dep end on the assigned subset The parallel computer also supp orts gang schedul ing by switching simultaneously the context of all pro cessors b elonging to the same partition This con text switch is executed by use of the lo cal pro cessor memory andor the lo cal hard disk while the intercon nection network is not aected except for message drain ing The context switch also causes a preemption p enalty mainly due to pro cessor synchronization mes sage draining saving of job status and page faults In the mo del we describ e this preemption p enalty by a con stant time delay p During this time delay no no de of an aected partition is able to execute any part of a job Note that the context switch do es not include job migration ie a change of the no de subset assigned to a job during job execution Individual no des are assigned to jobs in an exclusive fashion ie at any time instant any no de b elongs at most to a single partition This partition and the corresp onding job are said to b e active at this time instant This prop erty assures that the p erformance of a well balanced parallel job is not degraded by sharing one or a few no des with another indep endent job As gang scheduling is used a job must therefore b e either active on all or on none no des of its partition at the same time As this simple mo del do es not p erfectly match the original architecture we briey discuss the deviations of our mo del from actually implemented IBM RS SP computers
There are dierent typ es of no des in IBM SP architectures thin nodes wide nodes and high SMP nodes A wide no de usually has some kind of server functionality and contains more memory than a thin no de while there is little dierence in pro cessor p erformance Recently intro duced high no des contain several pro cessors and cannot form a partition with other no des at the present time However most installations contain predominantly thin no des
No des need not necessarily b e equipp ed with the same quantity of memory But in most applications the ma jority of no des have the same or a similar amount of memory While eg the Cornell Theory Center CTC SP contains no des with memory
ranging from MB to MB more than of the no des have either MB or MB
Access to mass storage usually requires the inclu sion of an IO no de to the partition which is typi cally a wide no de
Interactive jobs running only on a single pro cessor must not b e exclusively assigned to a no de How ever during op eration a parallel computer is usu ally divided into a batch and an interactive parti tion In our pap er we fo cus on the batch partition as the management of the interactive partition is closely related to the management of workstations
While most present SP installations do not al low preemption of parallel jobs IBM has already pro duced and implemented a prototyp e for gang scheduling
Altogether our mo del comes reasonably close to real implementations Finally note that parallel com puters may contain several hundred no des for the batch partition of the CTC SP Next we describ e the job mo del of the SP At rst the user identies whether he has got a batch or an interactive job As there typically are separate interactive and batch partitions we restrict ourselves to batch jobs Besides requesting sp ecial hardware the user can also sp ecify a minimum numb er of no des required and a maximum numb er of no des the program is able to use The scheduler will then assign to the program a numb er of no des within this range During execution of the program neither the numb er of no des nor the assigned subset of no des will change Using the terminology of Feitelson and Rudolph we have therefore a moldable job mo del and adaptive partitioning A user may submit his job to one of several batch queues at any time Each queue is describ ed by the maximum time wall clo ck time a job in this queue is allowed for execution No other information ab out the execution time hi of a job i is provided When a job exceeds the time limit of the assigned batch queue it is automatically canceled For our study we use the same mo del with two exceptions
No sp ecial requests are allowed
The exact numb er of required pro cessors is given for each job
Both restrictions are addressed in Section A similar mo del as the one describ ed ab ove has also b een used by Feldmann et al
In many commercial schedulers fairness is further observed by using the Firstcome rstserve principle But starting jobs in this order will only guarantee fairness in a nonpreemptive schedule In a preemptive schedule a job may b e interrupted by another job which has b een submitted later In the worst case this may result in job starvation ie the delay of job completion for a long p erio d of time Therefore we intro duce the following parameterized denition of fairness
Definition A scheduling strategy is fair if al l jobs submitted after a job i cannot increase the ow time of i by more than a factor
It is therefore the goal to nd a metho d which pro duces schedules with small values for (^) mmoptS (^) ccoptS and
The Algorithm
At rst we consider a simple nonpreemptive rst t list scheduling algorithm where the ordering of all jobs is determined by the submission times This metho d also allows to take into account those sp ecial hardware request or no de ranges with an appropriate strategy mentioned in Section Although this algorithm is actually used in commercial schedulers it may pro duce bad results as can b e seen by the example b elow
Example Simply assume R jobs with hi R ri and R jobs with hi ri R If jobs are submitted in quick succession from b oth groups alternatively then mS O R ^ R O R R mopt
Even mo dications like backlling cannot guar antee to avoid such a situation To solve this problem we intro duce preemption and accept A suitable algorithm for this purp ose may b e PSRS Preemptive Smith Ratio Scheduling which is based on a list and uses gang scheduling The list order in PSRS is de termined by the ratio (^) rwi hii As this ratio is the same for all jobs in our scheduling problem see Denition any order can b e used and PSRS is able to incorp orate FCFS An adaption of PSRS to our scheduling prob lem is called PFCFS Preemptive FCFS and given in Table Intuitively a schedule pro duced by Algorithm PFCFS can b e describ ed as the interleaving of two non preemptive FIFO schedules where one schedule contains at most one wide job at any time instant Note that only a wide job ri R can cause preemption and therefore increase the completion time of a previously submitted job Further all jobs are started in FIFO order Instruction A is resp onsible for the online character of Algorithm PFCFS We call any time p erio d
a p erio d of available resources PAR if during the whole p erio d Instruction A is executed and at least R resources are idle Note that any execution of Algorithm PFCFS will end with a PAR
Theoretical Analysis Before addressing the b ounds of schedules pro duced by Algorithm PFCFS in detail we describ e a few sp ecic prop erties of schedules where wi ri hi holds for all jobs i Note that these prop erties can also b e easily derived from more general statements of other publications
Corollary Assume that wi hi holds for al l jobs i of a sequential job system Then any non preemptive schedule S with no intermediate id le times between and the last completion time is optimal and there is
copt cS max i ft i g
i
h i
Proof The optimality of schedule S for any order of the jobs is a direct consequence of Smiths rule The b ound is clearly true for j j Adding a new job k to job system and executing it directly after the other jobs of schedule S pro duces schedule S ^ with
cS ^ cS h k hk max i fti g
max i ft i g
i
h i hk max i fti g h k
max i
f ti hk ^ g
i
h i h k
Corollary Assume that wi ri hi holds for al l jobs i in a job system Replacing a job i in any non preemptive schedule S by the successive execution of two jobs i and i with ri ri ri hi hi hi and wi wi wi reduces cS by hi hi ri
Proof Splitting of job i has no eect on the contribution to cS of any other job It is also indep endent of the lo cation of job i in the schedule as the weight of i and the sum of the weights of i and i are the same While the contribution of job i is ri h i wi ri ti hi in the original schedule it is ri h i hi hi hi wi ri ti hi in the new schedule The pro of still holds in the preemptive case if the second job is not preempted
As already mentioned in the previous section PFCFS will generate nonpreemptive FCFS schedules if all jobs are small ie they require at most of the maximum amount of resources ri R First we restrict ourselves to this case and prove some b ounds In Lemma we consider scenarios with a single PAR
while the parallel computer is active f if Q and no new jobs have b een submitted A wait for the next job to b e submitted attach all newly submitted jobs to Q in FIFO order Pick the rst job i of Q and delete it from Q if ri R f B wait until ri resources are available start i immediatelyg else f C wait until less than ri resources are used D If job i has not b een started E wait until ri resources are available or time p erio d h has passed else E wait until the previously used subset of ri resources is available or time p erio d h has passed If the required ri resources are available start or resume execution of i immediately else f F preempt all currently running jobs start or resume execution of i immediately G wait until i has completed or time p erio d h has passed H resume the execution of all previously preempted jobs If the execution of i has not b een completed Goto D ggg
Table The Scheduling Algorithm PFCFS
Lemma Let ri R and wi ri hi for al l jobs Also assume that there is no PAR before the submission of the last job Then Algorithm PFCFS wil l only produce nonpreemptive schedules S with the properties
mS mopt cS copt and
S is fair
Proof As Instructions C H cannot b e executed Algorithm PFCFS only pro duces nonpreemptive FCFS schedules which are fair Note that copt is lower b ounded by the cost of the optimal schedule where the submission times of all jobs are ignored We dene the time instant x maxftj for any time instant t^ t there are less than R resources id leg Note that x maxi fsi g Next we are transforming job system into job system ^ by splitting each job i with ti hi x ti into two jobs at time x According to Corollary we have cS ^ cS and copt ^ copt We partition into two disjoint sets (^) and (^) where (^) is the set of all jobs starting at x in S ^
Now we dene another time instant
x R
i (^)
hi ri
i (^)
minfx hi g ri
In order to pro duce a worst case schedule we maximize x by making the imp ossible assumption that exactly R resources are idle in schedule S at any time instant t x With Corollary we then obtain
cS ^
x R
i (^)
h i ri
i (^)
hi x hi ri
copt ^
x R
i
h i ri
i
maxfhi x g ^ ri
i (^)
maxfhi x g x ri
By use of the denition of x and the relations x x and minfx hi g maxfhi x g hi for all jobs i this results in
that a wide job i ri R is completed during Instruc tion G and examine the time p erio d from the start of Instruction D to the next execution of Instruction A B or C can b e split into parts
a single part of length h p!
b h hi c parts of length h p! and
a single part of length h hi b h hi c h !p
Note that each invo cation of Instructions E and E is executed for a time p erio d h during As hi h a resourcetime pro duct of more than R h is used during the rst part of During the second part of it is p ossible that a single long running sequential job prevents the availability of the necessary resources to execute the rest of job i in a nonpreemptive fashion Hence we only know that the resourcetime pro duct is more than Rh for a part of length h p! Finally only a resourcetime pro duct of more than h hi b h hi c hR is used during the last part of For the purp ose of worst case analysis we can therefore assume that hi h ^ b^
hi h c^ This means that has a length of hi p! h !p and that a combined resourcetime pro duct of more than hi h R is used during
The average usage of resources can b e signicantly increased if we allow a version of backlling that is the earlier execution of small jobs while a wide job is executed in a preemptive fashion Provided that a su cient numb er of those small jobs is available the average usage of resources will increase to (^) p However the completion time of a wide job may b e delayed by this metho d contrary to the original backlling suggested by Lifka Next we generalize Lemma to the general case
Lemma Let wi ri hi for al l jobs Also assume that there is no PAR before the submission of the last job Then Algorithm PFCFS wil l only produce schedules S with the properties
mS p! mopt
cS p! copt and
S is p! fair
Proof The completion time of a small job ri R may increase by hi p! due to a wide job which is submitted later and causes preemption This yields the fairness result Based on the pro of of Lemma we assume that the execution time of all jobs is small compared to the
makespan of the schedule Using our exp eriences in Lemma we also consider for the determination of (^) ccoptS only those schedules where the job set (^) of Lemma is empty Hence our schedules consist of p erio ds with wide jobs causing preemption and other p erio ds containing only small jobs Wlog we can assume that the average usage of the rst p erio ds is given by the b ound of Lemma while during the other p erio ds R resources are always used see the pro of of Lemma As the resource usage in the second group of p erio ds is higher and wi hi ri holds for all jobs a worst case schedule is generated by scheduling all those p erio ds after the preemptive p erio ds Of course there are always a few small jobs which must also b e scheduled in the preemptive p erio ds but they are also assumed to b e scheduled on top of the preemptive jobs for the purp ose of the analysis As derived in the pro of of Lemma a preemptive job i requires in the worst case a p erio d of length hi p! h p! hi p! if those small jobs in the b eginning are not taken into account Note that the second b ound is not tight for hi h By assuming that in schedule S the sum of the execution times for all preemptive jobs is x and all small jobs require a combined resourcetime pro duct of x R we^ obtain^ the^ following^ weighted^ completion^ time costs
cS
x R
p!
x R
x x R
p!
copt
x R ^
x R for^ x^ ^ x x R ^
x x x R ^
x x^ ^ R (^) for x ^ x First we consider the case x x From the inequations we determine a b ound for the ratio
cS copt
y ^ y !p p! y ^
As p! may b e dierent for each machine we gen erate two separate functions y^
(^) y y ^ and^
y p p y ^ and maximize b oth individually This way we obtain cS copt
p!
In b oth cases we have y x x For x x we obtain the following smaller b ound
cS copt
y ^ y p! p! y^ (^) y ^ ^ ^ ^ p!
Next assume that we have a set (^) of jobs which are scheduled at time !p x x in schedule S
Then those jobs are replaced by small jobs such that the resource time pro duct
i (^) ^ hi^ ri^ remains^ invariant^ and the small jobs are scheduled using R at any time instant This reduce copt at least as much as cS Therefore the ratio (^) ccoptS cannot increased by (^) Using the same denitions of x and x the following b ounds hold for the optimal makespan mopt hi for each job i mopt x and mopt x^ x With j b eing the job that nishes last we obtain
mS hj x x p!
hj
x x
p! x p! mopt
Finally we again remove the constraint on the numb er of PARs
Theorem Let wi ri hi for al l jobs Then Algorithm PFCFS wil l only produce schedules S with the properties
mS p! mopt
cS p! copt and
S is p! fair
Proof The pro of is done in the same fashion as the pro of of Theorem There are only some minor dierences
If job i with submission time si is the last job ending a PAR then the cut must b e p ositioned b etween !p si and p! si which gives enough freedom to cho ose the cut at a time instant when the execution of small jobs is resumed after a preemption p enalty
There is an extension to Corollary for preemp tive schedules However if a split reduces cS by then copt may only b e reduced by x if the dierence b etween the starting time and the completion time of any job i is b ounded by xhi
Taking these changes into account the techniques of the pro of for Theorem also prove the statements of this theorem
As mentioned b efore the b ounds for b oth ratios (^) ccoptS and (^) mmoptS can b e reduced if backlling is used However this will result in a more complicated analysis
Exp erimental Analysis In this section we evaluate various forms of preemptive FCFS scheduling with the help of workload data from the CTC SP for the months July to May These data include all batch jobs which ran on the CTC SP during this time frame For each job the submission time the start time and the completion time was recorded The submission queue was also provided as well as requests for sp ecial hardware for some jobs The reasons for the termination of a job successful completion failure to complete user termination termination due to exceeding of the queue limit time were not given and are not relevant for our simulations The generation of the trace data had no inuence on the execution time of the individual jobs The CTC uses a batch partition consisting of no des But there are only few jobs which need more than no des Taking further into account the exe cution times of these jobs preemption will not pro duce any noticeable gain see Theorem Therefore we assumed for our exp eriments parallel computers with and no des resp ectively All jobs requiring more no des were simply removed A list of the numb er of jobs for each month is given in Table Total ri ri Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May
Table Numb er of Jobs
In the exp eriments a wide job only preempts enough currently running jobs to generate a suciently large partition Those small jobs are selected by use of a simple greedy strategy This mo dication of Algorithm PFCFS do es not aect the theoretical analysis but improves the schedule p erformance for real workloads signicantly We generated our own FCFS schedule as a reference schedule and did not use the CTC schedule as some jobs were for instance submitted in Octob er and started in Novemb er Using the CTC schedule would not allow to evaluate each month separately As the preemption p enalty for the IBM gang scheduler is less than a millisecond it is neglected p! We did a large numb er of simulations with dierent preemption strategies For each strategy and each month we determined the makespan the total weighted
W Smith Various optimizers for singlestage pro duction Naval Research Logistics Quarterly C Stein and J Wein On the existence of schedules that are nearoptimal for b oth makespan and total weighted completion time Preprint JJ Turek W Ludwig JL Wolf L Fleischer P Ti wari J Glasgow U Schwiegelshohn and P Yu Scheduling parallelizable tasks to minimize average re sp onse time In Proceedings of the th Annual Sympo sium on Paral lel Algorithms and Architectures Cape May NJ pages June F Wang H Franke M Pap efthymiou P Pattnaik L Rudolph and MS Squillante A gang scheduling design for multiprogrammed parallel computing envi ronments In DG Feitelson and L Rudolph editors IPPS Workshop Job Scheduling Strategies for Par al lel Processing pages SpringerVerlag Lec ture Notes in Computer Science
P F C F S P F C F S P F C F S Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Sum
Table Results for Makespan and No des
P F C F S P F C F S P F C F S Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Sum
Table Results for Completion Time and No des
P F C F S P F C F S P F C F S Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Sum
Table Results for Flow Time and No des
P F C F S P F C F S P F C F S Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Sum
Table Results for Makespan and No des
P F C F S P F C F S P F C F S Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Sum
Table Results for Completion Time and No des
P F C F S P F C F S P F C F S Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Sum
Table Results for Flow Time and No des