Wednesday, December 1, 2010

My Second Super Computer

Cluster GPU Quadruple Extra Large 22 GB 
memory: 22 GB
EC2 Compute Units: 33.5 , 
GPU: 2 x NVIDIA Tesla “Fermi” M2050 GPUs, 
1690 GB of local instance storage, 64-bit platform, 10 Gigabit Ethernet
cores each: 448 
os: CENTOS 64bit

Monte Carlo on One Telsa Device

Options : 256
Simulation pathsCPUGPU
Time (ms.)options/sec.Time (ms.)options/sec.
2621446000423.58671388

Monte Carlo on Two Telsa Devices

Options : 256 split across two Tesla boards
Simulation pathsCPUGPU
Time (ms.)options/sec.Time (ms.)options/sec.
2621446000423.405151999

TOTAL Cost: $0.04

including building the environment and sample code from scratch.

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: "Tesla M2050"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.10
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2817982464 bytes
  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes

Device 1: "Tesla M2050"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.10
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2817982464 bytes
  Multiprocessors x Cores/MP = Cores:            14 (MP) x 32 (Cores/MP) = 448 (Cores)
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.15 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.10, NumDevs = 2, Device = Tesla M2050, Device = Tesla M2050

Monday, October 11, 2010

zeromq tests

MacPro:~ lydia$ python client.py --messages=1000 --message-size=1000000
Connecting to server...
1000 1000000 297.66666546 MB/s 297.66666546 messages/s

MacPro:~ lydia$ python client.py --messages=10000 --message-size=100000
Connecting to hello world server...
10000 100000 200.893595587 MB/s 2008.93595587 messages/s

MacPro:~ lydia$ python client.py --messages=100000 --message-size=1000
Connecting to hello world server...
100000 10000 54.5556325827 MB/s 5455.56325827 messages/s

MacPro:~ lydia$ python client.py --messages=100000 --message-size=1000
Connecting to hello world server...
100000 1000 8.36256888389 MB/s 8362.56888389 messages/s

MacPro:~ lydia$ python client.py --messages=100000 --message-size=100
Connecting to hello world server...
100000 100 0.866791151229 MB/s 8667.91151229 messages/s

MacPro:~ lydia$ python client.py --messages=100000 --message-size=10
Connecting to hello world server...
100000 10 0.0903534916772 MB/s 9035.34916772 messages/s

MacPro:~ lydia$ python client.py --messages=100000 --message-size=1
Connecting to hello world server...
100000 1 0.00909379595588 MB/s 9093.79595588 messages/s

Monday, October 4, 2010

File System speeds

MacPro

2x3Ghz Quad-Core Intel Xeon
16GB 667 Mhz DDR2
  Drive: ST31500341AS
  Capacity: 1.5 TB (1,500,301,910,016 bytes)
  Model: ST31500341AS                            
  Revision: SD17    
  Serial Number:             9VS0A3HN
  Native Command Queuing: Yes
  Queue Depth: 32
  Removable Media: No
  Detachable Drive: No
  BSD Name: disk0
  Rotational Rate: 7200
  Medium Type: Rotational
  Bay Name: Bay 1
  Partition Map Type: GPT (GUID Partition Table)
  S.M.A.R.T. status: Verified
  Volumes:
  File System: Journaled HFS+
  BSD Name: disk0s2

WRITING 12.4665911198 1024000000 82.1395351915 MB/s

On first run.
READING 1.72446203232 1024000000 593.808376647 MB/s
READING 1.66705989838 1024000000 614.255073256 MB/s
READING 1.66696095467 1024000000 614.291532824 MB/s

Lenovo Think Center

Intel Core i5 650 @3.2 Ghz
3.19 GHz, 2GB Ram
Hitachi HDS721025CLA382
Windows XP (32 bit)

1st Time READING: 58.547000, 2095736020, 35.7957 MB/s
2nd Time READING: 1.516000, 2095736020, 1382.4115 MB/s 
obviously the buffer kicked in.

Now the average for 10 threads: 37.6282 MB/s

Friday, October 1, 2010

My First Super Computer

Macbook Air
1.86 Ghz Intel Core 2 Duo
2 GB 1067 Mhz DDR3
GeForce 9400M

Total amount of global memory: 265945088 bytes
Number of multiprocessors: 2
Number of cores: 16

Monte Carlo



Options : 256
Simulation pathsCPUGPU
Time (ms.)options/sec.Time (ms.)options/sec.
262144800032.6245.89791041.08
131072400064127.682005
65536200012863.124055.57

I was thinking of building a big GPU box does anyone have any ideas ?


I'm thinking of getting:


EVGA Classified SR-2 (Super Record 2) 270-WS-W555-A1 LGA 1366 Intel 5520 SATA 6Gb/s USB 3.0 HPTX Intel Motherboard.



Adding 48 Gig and then plugging in 4 GeForce GTX 480 ??

Tuesday, September 28, 2010

Couchdb Performance on a MacPro

Couchdb 0.11.0
2x3 QUAD-Core Intel Xeon
16 GB 667Mhz DDR2
OS X 10.6.4

Inserting

NUMBLOCKtimeBytesMb/srecords/s
110.003010.000330
1010.02511000.004398
100013.241510000000.308308
10000134.62631000000002.888289
1000103.4122106100003.1092931
1001001.5112106010007.0156617
1010001.9464106001005.4465138
250002.3308106000204.5484290
1100002.0068106000105.2824983
101000015.71761060001006.7446362

Average Top:

33423  beam.smp     48.9      05:42.89 13    0    62   151-  39M-   264K  
33421  CouchDBX     12.9      00:52.85 6/1   3    124- 322   77M-   29M   


Monday, September 13, 2010

Couchdb Performance

Using:
Intel 2 Duo T9400 @ 2.53 GHz, 2.99 GB Ram HP Elitebook
Hitachi HTS723216L9A360
Couchdb 1.0 Local Host
Simple ASCII 1K Payload:
{
"_id": "00081363",
"_rev": "1-1c29ecbf7bc15e7f9226a45594a0605d",
"payload": "X"* 1024
}

Inserting

# of writesBlock size timebytesMb/sRecords/s
1 1 0.0050 1 0.000 200
10 1 0.0470 100 0.002 213
1000 1 4.5253 1000000 0.221221
10000 1 48.3623 100000000 2.068 207
1000 10 4.6867 10610000 2.2642134
100 100 2.1353 10601000 4.9654683
10 1000 2.4527 10600100 4.3224077
2 5000 2.2917 10600020 4.6254364
1 10000 3.4477 10600010 3.0752901


Extracting ALL with a js view

# of reads timebytesMb/sRecords/s
13848525.73001577333206.1305382

Extracting with a js view

# of readsBlock size timebytesMb/sRecords/s
1 1 0.0000 1189 1189000.0001000000000
10 1 0.0470 11890 0.253213
100 1 0.4220 118990 0.282237
1000 1 4.4990 1190890 0.265222
1000 10 6.0930 11494000 1.8861641
100 100 2.4210 11454400 4.7314131
10 1000 1.9690 11450440 5.8155079
2 5000 1.9370 11450088 5.9115163
1 10000 1.9370 11450044 5.9115163

Extracting ALL with a python

NUM time bytes Mb/s
138489 25.6690 157733320 6.145 5395

Extracting with a python view










NUM BLOCK time bytes Mb/s
1000 10 5.6090 11494000 2.049 1783
100 100 2.3280 11454400 4.920 4296
10 1000 1.9370 11450440 5.911 5163
2 5000 1.8900 11450088 6.058 5291
1 10000 1.9220 11450044 5.957 5203
1 1 0.0000 1189 1189000.000 1000000000
10 1 0.0470 11890 0.253 213
100 1 0.4060 118990 0.293 246
1000 1 4.5310 1190890 0.263 221

Extracting ALL

# of reads timebytesMb/sRecords/s
1384856.499012463712 1.91821309


Extracting
# of readsBlock size timebytesMb/sRecords/s
1 1 0.0160 1100 0.06962
10 1 0.0470 11000 0.234 213
100 1 0.4370 110000 0.252229
1000 10 9.6080 12004000 1.2491041
100 100 6.9990 11964400 1.7091429
10 1000 6.4980 11960440 1.8411539
2 5000 6.5620 11960088 1.8231524
1 10000 6.7330 11960044 1.7761485