2021-09-24, 00:03 | #23 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
169D_{16} Posts |
Why don't we compute GCD for a P-1 stage in parallel with the next stage or assignment?
See discussion beginning at https://mersenneforum.org/showpost.p...&postcount=439
Most computations are being done multicore at this point, in prime95 / mprime and Mlucas. P-1 GCD is an exception. Running P-1 stage 1 computations multicore, then GCD single core, then P-1 stage 2 if stage 1 did not find a factor and available memory is sufficient for stage 2, and then the P-1 stage 2 GCD single-core, leaves most of the cores idle for the GCD durations. Gpuowl does this. In a case with multiple Radeon VII GPUs served by a single slow CPU, sequential GCD was taking about 5 minutes of the 40 minute wavefront P-1 to optimal bounds and leaving the GPU idle during the GCD. Running GCD in parallel with speculatively proceeding with the next stage or assignment added ~14% to throughput. The chance of a following stage or PRP's progress on the same exponent being unnecessary, with near optimal bounds applied, is ~2%. If the following work is for a different exponent, there is no potential loss. The potential gain on cpu applications such as prime95 / mprime or Mlucas seems smaller. Ballpark calculations indicate of order 0.075 to 0.26% of P-1 time. It depends on bounds and number of cores / worker. The analysis neglects the initial higher speed a multicore worker may experience upon resumption of multicore operation from package cooldown during reduced-core-count operation during the serial GCD. Since optimized bounds and limits TF and P-1 each occupy about 1/40 as long as a primality test, the possible gain overall per exponent is diluted by a factor of about 1/42, to ~62. ppm of exponent (TF + P-1 + PRP) time in one case (880M on 16-core Xeon Phi 7210 worker), 5.05sec x 2 /2hr29min x 3cores/4cores = 0.075% of P-1 time in the I3-9100 (4 core no hyperthreading) 27.4M case; that would correspond to ~.075%x1/42 = 18. ppm of (TF + P-1 + PRP) time. Where hyperthreading is available, full core count might be available and productive for the parallel speculative execution of the next work while waiting for the GCD to complete. Where hyperthreading is not available, it might be necessary to temporarily reduce that worker's core count by one, while the GCD runs on that freed core. The above figures include that effect, of regaining n-1 cores' productivity out of n allocated to a worker. With hyperthreading used for GCD, it may be as high as 66. ppm of exponent time, ~0.28% of P-1 time savings. That such maneuvers are not being employed in mprime /prime95 or Mlucas may indicate that if the authors have evaluated it, they've determined their time is better spent elsewhere or there are higher priorities. The average of the 66 and 18 ppm possible gain is the equivalent of adding 2/3 of a computer to the 15761 seen active on the project in the past 30 days. Or finishing a year's running 22 minutes sooner. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-09-24 at 00:05 |
2021-09-26, 12:55 | #24 |
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7×827 Posts |
Why don't we preallocate PRP proof disk space in parallel with the computation?
(Some of this was originally posted as parts of https://mersenneforum.org/showpost.p...&postcount=441 and https://mersenneforum.org/showpost.p...&postcount=443)
In recent versions of mprime / prime95, the entire disk space footprint for PRP proof generation temporary residues is preallocated before the PRP computation begins, then the PRP computation starts. On a Xeon Phi 7210 this was observed to take about 3 minutes for a 500M PRP, using a single core of an otherwise-stalled 16-core worker. Why not run the preallocation on one core, and initial PRP iterations on the remaining cores of the worker in parallel? One could compute a time estimate for space preallocation and a time estimate for when depositing the first proof residue will be needed, and only parallelize when there's a comfortable time margin, and also ensure it wait for completion of preallocation. Preallocate PRP proof power 8 space took 15.6 GB, 3 minutes at 500M on Xeon Phi 7210 in Windows 10 with a rotating drive. Forecast PRP time 328.5 days ~473040 minutes. 3/473040 x 15cores/16cores= 6. ppm of PRP time saved. This is a microoptimization. Use of hyperthreading may allow slightly higher by using n rather than n-1 worker cores; in that subcase the preallocate operates on a thread using a different logical core (hyperthread). Proof generation disk space is proportional to exponent and exponential with proof power, so presumably preallocation time is ~linear with exponent, while PRP run time is proportional to ~exponent^{2.1}, so at 110M preallocate time at proof power 8 is ~110/500 x 3 min = 0.66 minutes; run time ~(110/500)^{2.1} x 328.5 d = 13.67 days; ratio .66 min /13.67days/(1440 minutes/day) = 34. ppm, substantially more than for larger exponents. If there are no truncation losses, 34 ppm is equivalent to adding 34e-6 x 15700 computers on GIMPS in the past month = 0.53 computers, or increasing an assumed average clock rate of 3GHz by 102. kHz. At (gpuowl maximum) proof power 10 the file would be 4 times larger so presumably take 4 times longer to preallocate, 2.64 minutes; at mprime/prime95 max proof power 12 the file would be yet 4 times larger so presumably take ~10.6 minutes to preallocate. A rough estimate of time from beginning of preallocation and PRP iterations to first proof residue save for prime95's max supported proof power 12 so earliest residue save is for 110M, 4-worker, 13.67d/2^{12}*1440min/day = 4.8 minutes. So there is in this case , Xeon Phi 7210, 8 workers or less, not sufficient time to fully preallocate in parallel, for the max proof power case vs. 10.6 minutes preallocate time projected. At proof power 10, 110M, first proof residue time would be ~19.2 minutes from start on Xeon 7210 4-worker, vs. ~2.6 minutes preallocate time; even a single-worker setup at 4.8 minute first proof residue time would be ok in parallel. At the default power 8, 0.66 minutes preallocate at 110M, vs. ~77. minutes to reach the first residue to save, there's ample time for parallel preallocation for 4-worker, and also in the 2-worker or 1-worker case or even up to 16-worker. There may be a need for caching the first residue in ram, or stall the worker until preallocate completed after the first proof generation interim residue was reached, in some other processor/drive/exponent/proof-power combination cases. The longer we wait the less worthwhile it becomes. SSDs replacing rotating drives may reduce preallocate time and so diminish the potential time savings. That would need to be a very quick modification to be worth the programming and test time and risk of new bugs and additional complexity. As always, it's the authors' call whether any perceived optimization is worthwhile relative to other opportunities and priorities. Note, this currently relates only to prime95 / mprime. Gpuowl supports proof powers up to 10 and does not preallocate. Mlucas does not have PRP proof generation implemented and released yet, so its behavior is TBD. Top of this reference thread: https://www.mersenneforum.org/showth...736#post510736 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2021-10-01 at 22:22 Reason: exponential proof file size growth with proof power; 3 exmple powers |