20210804, 12:22  #12 
Aug 2002
2^{2}×3×17×41 Posts 
We played around with 10867_67m1 which is SNFS(270.42) and has a 27M matrix.
025M  38GB  50H 100M  25GB  51H 500M  21GB  59H The first column is the block size (?) used on the GPU. (25M is the default.) The second column is the memory used on the GPU. The third column is the estimated time in hours for the LA phase. 
20210804, 12:29  #13 
Aug 2002
20AC_{16} Posts 
If you are using RHEL 8 (8.4) you can install the proprietary Nvidia driver easily via these directions:
https://developer.nvidia.com/blog/st...aritystreams/ Then you will need these packages installed: gcc make cudanvcc102 cudacudartdev10210.2.891 And possibly: gmpdevel zlibdevel You also have to manually adjust your path variable in ~/.bashrc: export PATH="/usr/local/cuda10.2/bin:$PATH" 
20210804, 19:32  #14  
Aug 2002
2^{2}·3·17·41 Posts 
Quote:
Code:
VBITS = 64; BLOCKS = 25M; MEM = 37.7GB; TIME = 58.8HR VBITS = 64; BLOCKS = 100M; MEM = 23.8GB; TIME = 66.5HR VBITS = 64; BLOCKS = 500M; MEM = 20.0GB; TIME = 98.9HR VBITS = 64; BLOCKS = 1750M; MEM = 19.3GB; TIME = 109.9HR VBITS = 128; BLOCKS = 25M; MEM = 37.4GB; TIME = 49.5HR VBITS = 128; BLOCKS = 100M; MEM = 24.2GB; TIME = 50.3HR VBITS = 128; BLOCKS = 500M; MEM = 20.7GB; TIME = 58.5HR VBITS = 128; BLOCKS = 1750M; MEM = 20.1GB; TIME = 61.2HR VBITS = 256; BLOCKS = 25M; MEM = 39.1GB; TIME = 47.4HR VBITS = 256; BLOCKS = 100M; MEM = 26.5GB; TIME = 37.2HR VBITS = 256; BLOCKS = 500M; MEM = 23.2GB; TIME = 37.2HR VBITS = 256; BLOCKS = 1750M; MEM = 22.6GB; TIME = 37.5HR VBITS = 512; BLOCKS = 25M; MEM = 44.1GB; TIME = 57.1HR VBITS = 512; BLOCKS = 100M; MEM = 32.2GB; TIME = 43.5HR VBITS = 512; BLOCKS = 500M; MEM = 28.9GB; TIME = 41.3HR VBITS = 512; BLOCKS = 1750M; MEM = 28.5GB; TIME = 40.9HR 

20210805, 01:04  #15 
Jul 2003
So Cal
2·3·7·53 Posts 
That's great! The older V100 definitely doesn't like the VBITS=256 blocks=100M or 500M settings. It doubles the runtime. Anyone using this really needs to test different settings on their card.

20210806, 14:05  #16 
Jun 2012
Boulder, CO
2^{4}·3·7 Posts 
Trying this out on an NVIDIA A100. Compiled and starts to run. I'm invoking with:
Code:
./msieve v g 0 i ./f/input.ini l ./f/input.log s ./f/input.dat nf ./f/input.fb nc2 Code:
Fri Aug 6 12:43:59 2021 commencing linear algebra Fri Aug 6 12:43:59 2021 using VBITS=256 Fri Aug 6 12:44:04 2021 read 36267445 cycles Fri Aug 6 12:45:36 2021 cycles contain 123033526 unique relations Fri Aug 6 12:58:53 2021 read 123033526 relations Fri Aug 6 13:02:37 2021 using 20 quadratic characters above 4294917295 Fri Aug 6 13:14:54 2021 building initial matrix Fri Aug 6 13:45:04 2021 memory use: 16201.2 MB Fri Aug 6 13:45:24 2021 read 36267445 cycles Fri Aug 6 13:45:28 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col) Fri Aug 6 13:45:28 2021 sparse part has weight 4115543151 (113.48/col) Fri Aug 6 13:50:59 2021 filtering completed in 1 passes Fri Aug 6 13:51:04 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col) Fri Aug 6 13:51:04 2021 sparse part has weight 4115543151 (113.48/col) Fri Aug 6 13:54:35 2021 matrix starts at (0, 0) Fri Aug 6 13:54:40 2021 matrix is 36267275 x 36267445 (17083.0 MB) with weight 4877632650 (134.49/col) Fri Aug 6 13:54:40 2021 sparse part has weight 4115543151 (113.48/col) Fri Aug 6 13:54:40 2021 saving the first 240 matrix rows for later Fri Aug 6 13:54:47 2021 matrix includes 256 packed rows Fri Aug 6 13:55:00 2021 matrix is 36267035 x 36267445 (15850.8 MB) with weight 3758763803 (103.64/col) Fri Aug 6 13:55:00 2021 sparse part has weight 3574908223 (98.57/col) Fri Aug 6 13:55:01 2021 using GPU 0 (NVIDIA A100SXM440GB) Fri Aug 6 13:55:01 2021 selected card has CUDA arch 8.0 Code:
25000136 36267035 221384 25000059 36267035 218336 25000041 36267035 214416 25000174 36267035 211066 25000044 36267035 212574 25000047 36267035 212320 25000174 36267035 207956 25000171 36267035 202904 25000117 36267035 197448 25000171 36267035 191566 25000130 36267035 185008 25000136 36267035 178722 24898531 36267035 168358 3811898 36267445 264 22016023 36267445 48 24836805 36267445 60 27790270 36267445 75 24929949 36267445 75 22849647 36267445 75 24896299 36267445 90 22990599 36267445 90 25502972 36267445 110 23602625 36267445 110 26327686 36267445 135 23662886 36267445 135 26145282 36267445 165 23549845 36267445 165 26371744 36267445 205 23884092 36267445 205 26835429 36267445 255 24055165 36267445 255 26699873 36267445 315 23947051 36267445 315 26916570 36267445 390 24419378 36267445 390 27622355 36267445 485 error (line 373): CUDA_ERROR_OUT_OF_MEMORY 
20210806, 14:12  #17 
Jul 2003
So Cal
2·3·7·53 Posts 
At the end after nc2, add block_nnz=100000000
Edit: That's a big matrix for that card. If that still doesn't work, try changing 100M to 500M, then 1000M, then 1750M. I think one of those should work. If you still run out of memory with 1750M, then switch to VBITS=128 and start over at 100M. and run through them again. That will use less GPU memory for the vectors saving more for the matrix. Finally, if you still run out of memory with VBITS=128 and block_nnz=1750000000 then use VBITS=512 with nc2 "block_nnz=1750000000 use_managed=1" That will save the matrix overflow in CPU memory and move it from there as needed. It's slower, but likely still faster than running the CPU version. Edit 2: I should add that once the matrix is built, you can skip that step with, e.g., nc2 "skip_matbuild=1 block_nnz=100000000" I haven't tested on an A100, so you may want to benchmark the various settings that work to find the optimal for your card. Last fiddled with by frmky on 20210806 at 14:55 
20210806, 16:39  #18  
Jun 2012
Boulder, CO
2^{4}×3×7 Posts 
Quote:
* all of 100M, 500M, 1000M, 1750M with VBITS=256 all ran out of memory * managed to get the matrix down to 27M with more sieving. VBITS=256, 1750M still runs out of memory. * will try VBITS=128 next with the various settings Is there any work planned to pick optimal (or at least, functional, won't crash) settings automatically? 

20210806, 17:52  #19 
Jul 2003
So Cal
2·3·7·53 Posts 
Optimal and functional are very different parameters. I can try automatically picking a block_nnz value that is more likely to work, but VBITS is a compiletime setting that can't be changed at runtime. Adding use_managed=1 will make it work in most cases but can significantly slow it down, so I've defaulted it to off.

20210806, 18:03  #20 
Jun 2012
Boulder, CO
2^{4}·3·7 Posts 
Looks like it's working now with VBITS=128, and a pretty decent runtime:
Code:
./msieve v g 0 i ./f/input.ini l ./f/input.log s ./f/input.dat nf ./f/input.fb nc2 block_nnz=1000000000 ... matrix starts at (0, 0) matrix is 27724170 x 27724341 (13842.2 MB) with weight 3947756174 (142.39/col) sparse part has weight 3351414840 (120.88/col) saving the first 112 matrix rows for later matrix includes 128 packed rows matrix is 27724058 x 27724341 (12940.4 MB) with weight 3222020630 (116.22/col) sparse part has weight 3059558876 (110.36/col) using GPU 0 (NVIDIA A100SXM440GB) selected card has CUDA arch 8.0 Nonzeros per block: 1000000000 converting matrix to CSR and copying it onto the GPU 1000000043 27724058 8774182 1000000028 27724058 9604503 1000000099 27724058 8923418 59558706 27724058 422238 1082873143 27724341 40960 954052655 27724341 1455100 916348921 27724341 16939530 106284157 27724341 9288468 commencing Lanczos iteration vector memory use: 2961.3 MB dense rows memory use: 423.0 MB sparse matrix memory use: 24188.7 MB memory use: 27573.0 MB Allocated 82.0 MB for SpMV library Allocated 88.6 MB for SpMV library linear algebra at 0.0%, ETA 20h11m7724341 dimensions (0.0%, ETA 20h11m) checkpointing every 1230000 dimensions341 dimensions (0.0%, ETA 22h42m) linear algebra completed 12223 of 27724341 dimensions (0.0%, ETA 20h46m) 
20210806, 18:03  #21 
Jul 2003
So Cal
2×3×7×53 Posts 
The 17.5M matrix for 2,1359+ took just under 15 hours on a V100.

20210806, 18:20  #22 
Jul 2003
So Cal
100010110010_{2} Posts 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Resume linear algebra  Timic  Msieve  35  20201005 23:08 
use msieve linear algebra after CADONFS filtering  aein  Msieve  2  20171005 01:52 
Has anyone tried linear algebra on a Threadripper yet?  fivemack  Hardware  3  20171003 03:11 
Linear algebra at 600%  CRGreathouse  Msieve  8  20090805 07:25 
Linear algebra proof  Damian  Math  8  20070212 22:25 