Deep Learning is very computationally intensive, so you will need a fast CPU with many cores, right? Or is it maybe wasteful to buy a fast CPU? One of the worst things you can do when building a deep learning system is to waste money on hardware that is unnecessary. Here I will guide you step by step through the hardware you will need for a cheap high performance system.
In my work on parallelizing deep learning I built a GPU cluster for which I needed to make careful hardware selections. Despite careful research and reasoning I made my fair share of mistakes when I selected the hardware parts which often became clear to me when I used the cluster in practice. Here I want to share what I have learned so you will not step into the same traps as I did.
GPU
This blog post assumes that you will use a GPU for deep learning. If you are building or upgrading your system for deep learning, it is not sensible to leave out the GPU. The GPU is just the heart of deep learning applications – the improvement in processing speed is just too huge too ignore.
I talked at length about GPU choice in my previous blog post, and the choice of your GPU is probably the most critical choice for your deep learning system. Generally, I recommend a GTX 680 from eBay if you lack money, a GTX Titan X (if you have the money; for convolution) or GTX 980 (very cost effective; a bit limited for very large convolutional nets) for the best current GPUs, a GTX Titan from eBay if you need cheap memory. I supported the GTX 580 before, but due to new updates to the cuDNN library which increase the speed of convolution dramatically, all GPUs that do not support cuDNN have become obsolete — the GTX 580 is such a GPU. If you do not use convolutional nets at all however, the GTX 580 is still a solid choice.
Can you identify the hardware part which is at fault for bad performance? One of these GPUs? Or maybe it is the fault of the CPU after all?
CPU
To be able to make a wise choice for the CPU we first need to understand the CPU and how it relates to deep learning. What does the CPU do for deep learning? The CPU does little computation when you run your deep nets on a GPU, but your CPU does still work on these things:
- Writing and reading variables in your code
- Executing instructions such as function calls
- Initiating function calls on your GPU
- Creating mini-batches from data
- Initiating transfers to the GPU
Needed number of CPU cores
When I train deep neural nets with three different libraries I always see that one CPU thread is at 100% (and sometimes another thread will fluctuate between 0 and 100% for some time). And this immediately tells you that most deep learning libraries – and in fact most software applications in general – just use a single thread. This means that multi-core CPUs are rather useless. If you run multiple GPUs however and use parallelization frameworks like MPI, then you will run multiple programs at once and you will need multiple threads also. You should be fine with one thread per GPU, but two threads per GPU will result in better performance for most deep learning libraries; these libraries run on one core, but sometimes call functions asynchronously for which a second CPU thread will be utilized. Remember that many CPUs can run multiple threads per core (that is true especially for Intel CPUs), so that one core per GPU will often suffice.
CPU and PCI-Express
It’s a trap! Some new Haswell CPUs do not support the full 40 PCIe lanes that older CPUs support – avoid these CPUs if you want to build a system with multiple GPUs. Also make sure that your processor actually supports PCIe 3.0 if you have a motherboard with PCIe 3.0.
CPU cache size
As we shall see later, CPU cache size is rather irrelevant further along the CPU-GPU-pipeline, but I included a short analysis section anyway so that we make sure that every possible bottleneck is considered along this pipeline and so that we can get a thorough understanding of the overall process.
CPU cache is often ignored when people buy a CPU, but generally it is a very important piece in the overall performance puzzle. The CPU cache is very small amount of on chip memory, very close to the CPU, which can be used for high speed calculations and operations. A CPU often has a hierarchy of caches, which stack from small, fast caches (L1, L2), to slow, large caches (L3, L4). As a programmer, you can think of it as a hash table, where every entry is a key-value-pair, and where you can do very fast lookups on a specific key: If the key is found, one can perform fast read and write operations on the value in the cache; if the key is not found (this is called a cache miss), the CPU will need to wait for the RAM to catch up and will then read the value from there – a very slow process. Repeated cache misses result in significant decreases in performance. Efficient CPU caching procedures and architectures are often very critical to CPU performance.
How the CPU determines its caching procedure is a very complex topic, but generally one can assume that variables, instructions, and RAM addresses that are used repeatedly will stay in the cache, while less frequent items do not.
In deep learning, the same memory is read repeatedly for every mini-batch before it is sent to the GPU (the memory is just overwritten), but it depends on the mini-batch size if its memory can be stored in the cache. For a mini-batch size of 128, we have 0.4MB and 1.5 MB for MNIST and CIFAR, respectively, which will fit into most CPU caches; for ImageNet, we have more than 85 MB () for a mini-batch, which is much too large even for the largest cache (L3 caches are limited to a few MB).
Because data sets in general are too large to fit into the cache, new data need to be read from the RAM for each new mini-batch – so there will be a constant need to access the RAM either way.
RAM memory addresses stay in the cache (the CPU can perform fast lookups in the cache which point to the exact location of the data in RAM), but this is only true if your whole data set fits into your RAM, otherwise the memory addresses will change and there will be no speed up from caching (one might be able to prevent that when one uses pinned memory, but as you shall see later, it does not matter anyway).
Other pieces of deep learning code – like variables and function calls – will benefit from the cache, but these are generally few in number and fit easily into the small and fast L1 cache of almost any CPU.
From this reasoning it is sensible to conclude, that CPU cache size should not really matter, and further analysis in the next sections is coherent with this conclusion.
Needed CPU clock rate (frequency)
When people think about fast CPUs they usually first think about the clock rate. 4GHz is better than 3.5GHz, or is it? This is generally true for comparing processors with the same architecture, e.g. “Ivy Bridge”, but it does not compare well between processors. Also it is not always the best measure of performance.
In the case of deep learning there is very little computation to be done by the CPU: Increase a few variables here, evaluate some Boolean expression there, make some function calls on the GPU or within the program – all these depend on the CPU core clock rate.
While this reasoning seems sensible, there is the fact that the CPU has 100% usage when I run deep learning programs, so what is the issue here? I did some CPU core rate underclocking experiments to find out.
So why is the CPU usage at 100% when the CPU core clock rate is rather irrelevant? The answer might be CPU cache misses: The CPU is constantly busy with accessing the RAM, but at same time the CPU has to wait for the RAM to catch up with its slower clock rate and this might result in a paradoxically busy-with-waiting state. If this is true, then underclocking the CPU core would not result in dramatic decreases in performance – just like the results you see above.
The CPU also performs other operations, like copying data into mini-batches, and preparing data to be copied to the GPU, but these operations depend on the memory clock rate and not the CPU core clock rate. So now we look at the memory.
Needed RAM clock rate
CPU-RAM and other interactions with the RAM are quite complicated. I will here show a simplified version of the process. Lets dive in and dissect this process from CPU RAM to GPU RAM for a more thorough understanding.
The CPU memory clock and RAM are intertwined. The memory clock of your CPU determines the maximum clock rate of your RAM and both pieces are the overall memory bandwidth of your CPU, but usually the RAM itself determines the overall available bandwidth because it can be slower than the CPU memory rate. You can determine the bandwidth like this:
Where the 64, is for a 64-bit CPU architecture. For my processors and RAM modules the bandwidth is 51.2GB/s.
However, the bandwidth is only relevant if you copy large amounts of data. Usually the timings – for example 8-8-8 – on your RAM are more relevant for small pieces of data and determine how long your CPU has to wait for your RAM to catch up. But as I outlined above, almost all data from your deep learning program will either easily fit into the CPU cache, or will be much too large to benefit from caching. This implies that timings will be rather unimportant and that bandwidth might be important.
So how does this relate to deep learning programs? I just said that bandwidth might be important, but this is not so when we look at the next step in the process. The memory bandwidth of your RAM determines how fast a mini-batch can be overwritten and allocated for initiating a GPU transfer, but the next step, CPU-RAM-to-GPU-RAM is the true bottleneck – this step makes use of direct memory access (DMA). As quoted above, the memory bandwidth for my RAM modules are 51.2GB/s, but the DMA bandwidth is only 12GB/s!
The DMA bandwidth relates to the regular bandwidth, but the details are unnecessary and I will just refer you to this Wikipedia entry, in which you can look up the DMA bandwidth for RAM modules (peak transfer limit). But lets have a look at how DMA works.
Direct memory access (DMA)
The CPU with its RAM can only communicate with a GPU through DMA. In the first step, a specific DMA transfer buffer is reserved in both CPU RAM and GPU RAM; in the second step the CPU writes the requested data into the CPU-side DMA buffer; in the third step the reserved buffer is transferred to your GPU RAM without any help of the CPU. Your PCIe bandwidth is 8GB/s (PCIe 2.0) or 15.75GB/s (PCIe 3.0), so you should get a RAM with a good peak transfer limit as determined from above, right?
Not necessarily. Software plays a big role here. If you do some transfers in a clever way, you will get away with cheaper slower memory. Here is how.
Asynchronous mini-batch allocation
Once your GPU finished computation on the current mini-batch, it wants to immediately work on the next mini-batch. You can now of course, initiate a DMA transfer and then wait for the transfer to complete so that your GPU can continue to crunch numbers. But there is a much more efficient way: Prepare the next mini-batch in advance so that your GPU does not have to wait at all. This can be done easily and asynchronously with no degradation in GPU performance.
An ImageNet 2012 mini-batch of size 128 for Alex Krishevsky’s convolutional net takes 0.35 seconds for a full backprop pass. Can we allocate the next batch in this time?
If we take the batch size to be 128 and the dimensions of the data 244x244x3 that is a total of roughly 0.085 GB (). With an ultra-slow memory we have 6.4 GB/s, or in other terms 75 mini-batches per second! So with asynchronous mini-batch allocation even the slowest RAM will be more than sufficient for deep learning. There is no advantage in buying faster RAM modules if you use asynchronous mini-batch allocation.
This procedure also implies indirectly that the CPU cache is irrelevant. It does not really matter how fast your CPU can overwrite (in the fast cache) and prepare (write the cache to RAM) a mini-batch for a DMA transfer, because the whole transfer will be long completed before your GPU requests the next mini-batch – so a large cache really does not matter much.
So the bottom line is really that the RAM clock rate is irrelevant. Buy what is cheap – end of story.
But how much should you buy?
RAM size
You should have at least the same RAM size as your GPU has. You could work with less RAM, but you might need to transfer data step by step. From my experience however, it is much more comfortable to work with more RAM.
Psychology tells us that concentration is a resource that is depleted over time. RAM is one of the few hardware pieces that allows you to conserve your concentration resource for more difficult programming problems. Rather than spending lots of time on circumnavigating RAM bottlenecks, you can invest your concentration on more pressing matters if you have more RAM. With a lot of RAM you can avoid those bottlenecks, save time and increase productivity on more pressing problems. Especially in Kaggle competitions I found additional RAM very useful for feature engineering. So if you have the money and do a lot of pre-processing then additional RAM might be a good choice.
Hard drive/SSD
A hard drive can be a significant bottleneck in some cases for deep learning. If your data set is large you will typically have some of it on your SSD/hard drive, some of it in your RAM, and two mini-batches in your GPU RAM. To feed the GPU constantly, we need to provide new mini-batches with the same rate as the GPU can go through each of them.
For this to be true we need to use the same idea as asynchronous mini-batch allocation. We need to read files with multiple mini-batches asynchronously – this is really important! If we do not do this asynchronously you will cripple your performance by quite a bit (5-10%) and render your carefully crafted advantages in hardware useless – good deep learning software will run faster on a GTX 680, than bad deep learning software on a GTX 980.
With this in mind, we have in the case of the Alex’s ImageNet convolutional net 0.085GB () every 0.3 seconds, or 290MB/s if we save the data as 32 bit floating data. If we however save it as jpeg data, we can compress it 5-15 fold bringing down the required read bandwidth to about 30MB/s. If we look at hard drive speeds we typically see speeds of 100-150MB/s, so this will be sufficient for data compressed as jpeg. Similarly, one is able to use mp3 or other compression techniques for sound files, but for other data sets that deal with raw 32 bit floating point data it is not possible to compress data so well: We can compress 32 bit floating point data by only 10-15%. So if you have large 32 bit data sets, then you will definitely need a SSD, as hard drives with a speed of 100-150 MB/s will be too slow to keep up with your GPU – so if you work with such data get a SSD, otherwise a hard drive will be fine.
Many people buy a SSD for comfort: Programs start and respond more quickly, and pre-processing with large files is quite a bit faster, but for deep learning it is only required if your input dimensions are high and you cannot compress your data sufficiently.
If you buy a SSD you should get one which is able to hold data sets of sizes you typically work with, with an additional few tens of GBs extra space. It is also a good idea to also get a hard drive to store your unused data sets on.
Power supply unit (PSU)
Generally, you want a PSU that is sufficient to accommodate all your future GPUs. GPUs typically get more energy efficient over time; so while other components will need to be replaced, a PSU should last a long while so a good PSU is a good investment.
You can calculate the required watts by adding up the watt of your CPU and GPUs with an additional 100-300 watts for other components and as a buffer for power spikes.
One important part to be aware of is if the PCIe connectors of your PSU are able to support a 8pin+6pin connector with one cable. I bought one PSU which had 6x PCIe ports, but which was only able to power either a 8pin or 6pin connector, so I could not run 4 GPUs with that PSU.
Another important thing is to buy a PSU with high power efficiency rating – especially if you run many GPUs and will run them for a longer time.
Running a 4 GPU system on full power (1000-1500 watts) to train a convolutional net for two weeks will amount to 300-500 kWh, which in Germany – with rather high power costs of 20 cents per kWh – will amount to 60-100€ ($66-111). If this price is for a hundred per-cent efficiency, then training such a net with a 80% power supply would increase the costs by an additional 18-26€ – ouch! This is much less for a single GPU, but the point still holds – spending a bit more money on an efficient power supply makes good sense.
Cooling
Cooling is important and it can be a significant bottleneck which reduces performance more than poor hardware choices do. You should be fine with a standard heat sink for your CPU, but what for your GPU you will need to make special considerations.
Modern GPUs will increase their speed – and thus power consumption – up to their maximum when they run an algorithm, but as soon as the GPU hits a temperature barrier – often 80 °C – the GPU will decrease the speed so that the temperature threshold is not breached. This enables best performance while keeping your GPU safe from overheating.
However, typical pre-programmed schedules for fan speeds are badly designed for deep learning programs, so that this temperature threshold is reached within seconds after starting a deep learning program. The result is a decreased performance (a few per-cents) which can be significant for multiple GPUs (10-25%) where each GPU heats up the GPUs next to itself.
Since NVIDIA GPUs are first and foremost gaming GPUs, they are optimized for Windows. You can change the fan schedule with a few clicks in Windows, but not so in Linux, and as most deep learning libraries are written for Linux this is a problem.
The easiest and most cost efficient work-around is to flash your GPU with a new BIOS which includes a new, more reasonable fan schedule which keeps your GPU cool and the noise levels at an acceptable threshold (if you use a server, you could crank the fan speed to maximum speed which is otherwise not really bearable on a noise level). You can also overclock your GPU memory with a few MHz (30-50) and this is very safe to do. The software for flashing BIO is a program designed for Windows, but you can use wine to call that program from your Linux/Unix OS.
The other option is to use to set a configuration for your Xorg server (Ubuntu) where you set the option “coolbits”. This works very well for a single GPU, but if you have multiple GPUs where some of them are headless, i.e. they have no monitor attached to them, you have to emulate a monitor which is hard and hacky. I tried it for a long time and had frustrating hours with a live boot CD to recover my graphics settings – I could never get it running properly on headless GPUs.
Another, more costly, and craftier option is to use water cooling. For a single GPU, water cooling will nearly halve your temperatures even under maximum load, so that the temperature threshold is never reached. Even multiple GPUs stay cool which is rather impossible when you cool with air. Another advantage of water cooling is that it operates much more silently, which is a big plus if you run multiple GPUs in an area where other people work. Water cooling will cost you about $100 for each GPU and some additional upfront costs (something like $50). Water cooling will also require some additional effort to assemble your computer, but there are many detailed guides on that and it should only require a few more hours of time in total. Maintenance should not be that complicated or effortful.
From my experience these are the most relevant points. I bought large towers for my deep learning cluster, because they have additional fans for the GPU area, but I found this to be largely irrelevant: About 2-5 °C decrease, not worth the investment and the bulkiness of the cases. The most important part is really the cooling solution directly on your GPU – flash your BIOS, use water cooling, or live with a decrease in performance – these are all reasonable choices in certain situations. Just think about what do you want in your situation and you will be fine.
Motherboard and computer case
Your motherboard should have enough PCIe ports to support the number of GPUs you want to run (usually limited to four GPUs, even if you have more PCIe slots); remember that most GPUs have a width of two PCIe slots, so you will need 7 slots to run 4 GPUs for example. PCIe 2.0 is okay for a single GPU, but PCIe 3.0 is quite cost efficient with respect to cost-performance even for a single GPU; for multiple GPUs always buy PCIe 3.0 boards which will be a boon when you do multi-GPU computing as the PCIe connection will be the bottleneck here.
The motherboard choice is straightforward: Just pick a motherboard that supports the hardware components that you want.
When you select a case, you should make sure that it supports full length GPUs that sit on top of your motherboard. Most cases support full length GPUs, but you should be suspicious if you buy a small case. Check its dimensions and specifications; you can also try a google image search of that model and see if you find pictures with GPUs in them.
Monitors
I first thought it would be silly to write about monitors also, but they make such a huge difference and are so important that I just have to write about them.
The money I spent on my 3 27 inch monitors is probably the best money I have ever spent. Productivity goes up by a lot when using multiple monitors. I feel desperately crippled if I have to work with a single monitor. Do not short-change yourself on this matter. What good is a fast deep learning system if you are not able to operate it in an efficient manner?
Some words on building a PC
Many people are scared to build computers. The hardware components are expensive and you do not want to do something wrong. But it is really simple as components that do not belong together do not fit together. The motherboard manual is often very specific how to assemble everything and there are tons of guides and step by step videos which guide you through the process if you have no experience.
The great thing about building a computer is, that you know everything that there is to know about building a computer when you did it once, because all computer are built in the very same way – so building a computer will become a life skill that you will be able to apply again and again. So no reason to hold back!
Conclusion / TL;DR
GPU: GTX 680 or GTX 960 (no money); GTX 980 (best performance); GTX Titan (if you need memory); GTX 970 (no convolutional nets)
CPU: Two threads per GPU; full 40 PCIe lanes and correct PCIe spec (same as your motherboard); > 2GHz; cache does not matter;
RAM: Use asynchronous mini-batch allocation; clock rate and timings do not matter; buy at least as much CPU RAM as you have GPU RAM;
Hard drive/SSD: Use asynchronous batch-file reads and compress your data if you have image or sound data; a hard drive will be fine unless you work with 32 bit floating point data sets with large input dimensions
PSU: Add up watts of GPUs + CPU + (100-300) for required power; get high efficiency rating if you use large conv nets; make sure it has enough PCIe connectors (6+8pins) and watts for your (future) GPUs
Cooling: Set coolbits flag in your config if you run a single GPU; otherwise flashing BIOS for increased fan speeds is easiest and cheapest; use water cooling for multiple GPUs and/or when you need to keep down the noise (you work with other people in the same room)
Motherboard: Get PCIe 3.0 and as many slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)
Monitors: If you want to upgrade your system to be more productive, it might make more sense to buy an additional monitor rather than upgrading your GPU
Update 2015-04-22: Removed recommendation for GTX 580
Hi Tim,
This is a great overview. Wondering if you could recommend any cost-effective CPUs with 40 PCIe lanes.
Thanks!
There are many CPUs in all different price ranges which are all reasonable choices and most CPUs support 40 PCIe lanes. The best practice is probably to look at site like http://pcpartpicker.com/parts/cpu/ an select a CPU with a good rating and a good price; then check if it supports the 40 lanes and you will be good to go.
Covers everything i wanted to know and even more, thanks!
It also confirms my choice for a pentium g3258 for a single GPU config. Insanely cheap, and even has ecc memory support, something that some folks might want to have..
what,the.heck… Could you have skipped the blather and gotten to the point? There are only a few specific combinations that support what you were trying to explain so maybe something like:
– GTX 580/980
– i5 / i7 CPU
– Lots of ram (duh)
– Fast hard drive
Give a man a fish and you feed him for a day; teach a man to fish and you feed him for a lifetime.
授人以鱼不如授人以渔. same proverb as in Chinese.
very helpfull, thanks for the sharing.
I find the recommendation of the GTX 580 for *any* kind of deep learning or budget a little dubious since it doesn’t support cuDNN. What good is a GPU that doesn’t support what’s arguably the most important library for deep learning at the moment?
This is a really good and important point. Let me explain my reasoning why I think a GTX 580 is still good.
The problem with no cuDNN support is really that you will require much more time to set everything up and often cutting-edge features that are implemented in libraries like torch7 will not be available. But it is not impossible to do deep learning on a GTX 580 and good, usable deep learning software exists. One will probably need to learn CUDA programming to add new features through one’s own CUDA kernels, but this will just require time and not money. For some people time and effort is relatively cheap, while money is rather expensive. If you think about students in developing countries this is very much true; if you earn $5500 a year (average GDP per capita ppp of India; for the US this is $53k – so think about your GPU choice if you had 10 times less money) then you will be happy that there is a deep learning option that costs less than $120. Of course I could recommend cards, like the GTX 750, which are also in that price range and which work with cuDNN, but I think a GTX 580 (must faster and more memory) is just better than a GTX 750 (cuDNN support) or other alternatives.
EDIT: I think it might be good to add another option, which offers support for cuDNN but which is rather cheap, like the GTX 960 4GB (only a bit slower than the GTX 580) which will be available shortly for about $250-300. But as you see, an additional $130-180 can be very painful if you are a student in a developing country.
A great 2016 update if you happen to still frequent this blog (don’t see any recent posts) is the new GTX 1060 Pascal graphic card. Specifically the 3GB model. Now 3GB is definitely cutting a tad close on memory, however it’s a VASTLY superior choice to both a 580 AND a 960 4gb. The 1060 6GB model is equivalent to a GTX 980 in overall performance, and the 3GB 1060 model is only ever-so-slightly weaker putting it at the level of a hugely overclocked GTX 970 (i’m talking like ~1,650mhz 970 levels. Which is maybe ~5% below a 980)
And the 3GB 1060 can be had for a measly $199 BRAND NEW! It’s definitely something to consider at least. And if you still desperately need that extra VRAM then you can even get the 6GB version of the 1060 (which as i mentioned is literally about tied with an average GTX 980! ) can be had for as little as $249 right now!
I updated my GPU recommendation post with the GTX 1060, but I did not mention the 3GB version, that did not exist at that time. Thanks for letting me know!
Hi,
I want to get a system with GPU for speech processing and deep learning application using python language.
Can you please inbox me the reasonable system hardware requirements?
For these applications a “standard” deep learning system will be sufficient. You can find examples of such systems in the comments section (search for “pcpartpicker” and you will probably find some examples).
Thoughts on the Tesla K40? It’s one of the GPUs available through NVIDIA’s academic hardware grant program: https://developer.nvidia.com/academic_hw_seeding
A K40 will be similar to a GTX Titan in terms of performance. The additional memory will be great if you train large conv nets and this is the main advantage of a K40. If you can choose the upcoming GTX Titan X in the academic grant program, this might be the better choice as it is much faster and will have the same amount of memory.
why is k40 much more expensive when gtx x is cheaper but has more cores and higher bandwidth?
The K40 is a compute card which is used for scientific applications (often system of partial differential equations) which require high precision. Tesla cards have additional double precision and memory correction modules which makes them excel at high precision tasks; these extra features, which are not needed in deep learning, make them so expensive.
ImageNet on K40:
Training is 19.2 secs / 20 iterations (5,120 images) – with cuDNN
and GTX770:
cuDNN Training: 24.3 secs / 20 iterations (5,120 images)
(source: http://caffe.berkeleyvision.org/performance_hardware.html)
I trained ImageNet model on a GTX 960 and have this result:
Training is around 26 secs / 20 iterations (5,120 images) – with cuDNN
So GTX 960 is close to GTX 770
So for 450000 iterations, it takes 120 hours (5 days) on K40, and 162.5 hours (6.77 days) on GTX 960.
Now K40 costs > 3K USD, and GTX 960 costs < 300 USD 🙂
Thanks, this is very useful information!
About (possibly) multiple GPUs, would nvidia SLI be of any significant help?
Thanks for your comment. NVIDIA SLI is an interface which allows to render computer graphics frames on each GPU and exchange them via SLI. The use of SLI is limited to this application, so doing computations and parallelizing them via SLI is not possible (one needs to use the PCIe interface for this). So CUDA cannot use SLI.
What IDE are you using in that pic? It looks like Eclipse but I can’t quite tell. Great article, a full breakdown is just what I needed!
Glad that you liked the article. I am using Eclipse (NVIDIA Nsight) for C++/C/CUDA in that pic; I also use Eclipse for Python (PyDev) and Lua (Koneki). While I am very satisfied with Eclipse for Python and CUDA, I am less satisfied with Eclipse for Lua (that is torch7) and I probably will switch to Vim for that.
Thanks for this great post!
What’s your thought on using g2.xlarge instead of building the hardware? I believe g2.xlarge is a lot slower than GTX 980. However it is possible to spawn many instances on AWS at the same time which might be useful for tuning hyperperameter.
Indeed the g2.xlarge is much slower than the GTX 980, but also much cheaper. It is a cheap option if want to train multiple independent neural nets, but it can be very messy. I only have experience with regular CPU instances, but with those it can take considerable time to manage one’s instances, especially if you are using AWS for large data sets together with spot instances — you will definitely be more productive with a local system. But in terms of affordability GPU instances are just the best.
I just want you make you aware of other downsides with GPU instances, but the overall conclusion stays the same (less productivitz, but very cheap): You cannot use multiple GPUs on AWS instances because the interconnect is just too slow and will be a major bottle neck (4 GPUs will run slower than 2). Also the PCIe interconnect performance is crippled by the virtualization. This can be partly improved by a hacky patch, but overall the performance will still be bad (it might well be that 2 GPUs are worse than 1 GPU).
Also like the GTX 580, the GPU instances do not support newer software, and this can be quite bad if you want to run modern variants of convolutional nets.
What motherboards by company and model number do you recommend (ASUS, MSI, etc) for a home PC that will be used for multimedia as well (not concerned with gaming). I am thinking of using a single GTX 980 but may think about add more GPU’s later(not a crucial concern). Also, what i7 cpu models do I need? Thanks for the help and the suggestion of the 960 alternative to the 580. I am learning Torch 7 and can afford the 980.
I only have experience with motherboards that I use, and one of them has a minor hardware defect and thus I do not think my experience is representative for the overall mainboard product, and this is similar for other hardware pieces. I think with the directions I gave in this guide you can find your pieces on your own through lists that feature user rating like http://pcpartpicker.com/parts/
Often it is quite practical to sort by rating and buy the first highly rated hardware piece which falls in your budget.
What is the largest dataset you can analyze, you can choose the specs you want, and how much time would it take?
The sky is the limit here. Google ran conv nets that took months to complete and which were run on thousands of computers. For practical data sets, ImageNet is one of the larger data sets and you can expect that new data sets will grow exponentially from there. These data sets will grow as your GPUs get faster, so you can always expect that the state of the art on a large popular data set will take about 2 weeks to train.
Motherboard: Get PCIe 3.0 and as many slots as you need for your (future) GPUs (one GPU takes two slots; max 4 GPUs per system)
just to be sure I get it.
all GPU are better on a PCIe 3.0 slot, as each GPU seems to take 2 slots (due to size) for 3 GPU you’d need a 6 PCIe 3.0 slots MB ?
That’s right, modern GPUs will run faster on a PCIe 3.0 slot.
To install a card you only need a single PCIe 3.0 slot, but because you have a width of two PCIe slots each card will render the PCIe slot next to it unusable. For 3 GPUs you will need 5 PCIe slots, because the first two cover 4 slots and you will need a single fifth slot for the last GPU.
So a motherboard with 5x PCIe 3.0 x16 is fine for 3 GPUs.
when I was mining bitcoins (unfortunately with radeon 🙂 hence why I’m so interested in your article) I used PCI risers like (http://www.amazon.fr/gp/product/B001CC3BNS?psc=1&redirect=true&ref_=oh_aui_search_detailpage)
Do you think those can act as a bottleneck between the PCIe 3.0 slot of the MB and the GPU ?
Using those could prove useful in finding cheaper MB with less PCIe 3.0 slots.
I also read a bit about risers when I was building my GPU cluster, and I often read that there was little to no degradation in performance. However, I do not know what PCIe lane configuration (e.g. 16/8/8/8, or 16/16/8 are standard for 4 and 3 GPUs, respectively) the motherboard will run under such a configuration and this might be a problem (the motherboard might not support it well). For cryptocurrency mining this is usually not a problem, because you do not have to transfer as much data over the PCIe interface if you compare that to deep learning — so probably there is no one that ever tested this under deep learning conditions.
So I am not really sure how it will work, but it might be worth a try to test this on one of your old mining motherboards and then buy a motherboard accordingly. If you decide to do so, then please let me know. I would be really interested in what is going on in that case and how well it works. Thanks!
Hi Tim,
I bought an MSI G80 Laptop to learn and work on deep learning which connects 2 GPU using SLI, could you please tell me if I could run deep learning on this laptop even in one GPU.
Regards,
Yes you will be able to use a single GPU for deep learning, SLI has nothing to do with CUDA. Even if there are dual-GPUs (like the GTX 590) on a hardware level you can simply access both GPUs separately. This is also true for software libraries like theano and torch.
Thanks Tim,
Because I don’t have background in coding, I want to use existing libraries. By the way, I bought this laptop not for gaming for deep learning, I thought would be more powerful with 2 GPUs, but even if one works fine that is ok for me. Regards,
You’re welcome! If you use Torch7 you would will be able to use both GPUs quite easily. If you dread working with Lua (it is quite easy actually, the most code will be in Torch7 not in Lua), I am also working on my own deep learning library which will be optimized for multiple GPUs, but it will take a few more weeks until it reaches a state which is usable for the public.
Looking a two possible x99 boards, ASUS x99-Deluxe (~$410 US) and ASUS Rampage V Extreme (~$450 US). Unless you know something, I do not see that the extra $40 will make any difference for ML but maybe it does for other stuff like multi-media or gaming.
Will start with 16G or 32G DDR4 (haved decided yet, ~$500-$700 US).
I plan to use the 6-core i7-5930k (~$570 US). By your recommendations of 2 cores per GPU that means max 3 GPU’s.
GTX 980’s are ~$500 US and GTX Titans ~$1000 US. Besides loss of PCI slots, extra liquid cooling, what speed difference does one expect in a system with two GTX 980’s versus an identical system with one GTX Titan?
I do not think the boards make a great difference, they are rather about the chipset (x99) than anything else.
I think 6 cores should also be fine for 4 GPUs. On average, the second core is only used sparsely, so that 3 threads can often feed 2 GPUs just fine.
One GTX Titan X will be 150% as fast as a single GTX 980, so two GTX 980 are faster, but because one GPU is much better and easier to use than two, I would go for the GTX Titan X if you can afford it.
“One GTX Titan X will be 150% as fast as a single GTX 980, so two GTX 980 are faster, but because one GPU is much better and easier to use than two, I would go for the GTX Titan X if you can afford it.”
Thanks for advice. Could you elaborate a bit more on the ease of use between one gpu versus two?
Also, i understand the Titan will be replace this year with a faster GTX 980 Ti. They will be the same price.
If you use torch7 then it will be quite straight forward to use 2 GPUs on one problem (2 GPUs yield about 160% speed when compared to a single GPU); other libraries do not support multiple GPUs well (theano/pylearn2, caffe), and others are quite complicated to use (cuda-convnet2). So 160% is not much faster than a GTX Titan X, so if you want to also use different libraries, a GTX Titan X would be faster overall (and more memory too!).
I am just working on a library that combines the ease of use of torch7 with very efficient parallelizm (+190% speedup for 2 GPUs), but it will take a month or two until I implemented all the needed features.
Hi Tim,
I’m interested about the GPU bios. Can you share what bios which includes a new, more reasonable fan schedule are you using right now? I have 2 titan x waiting to be flashed.
I do not know if GTX 970/ GTX 980 BIOS is compatible with a GTX Titan X BIOS. Doing a quick google search, I cannot find information about a GTX Titan X BIOS, which might be, because the card is relatively new.
I think you will find the best information in folding@home and other crowd-computing forums (also cryptocurreny mining forums) to get this working.
Thanks for the pointers. fah is very interesting XD, though I don’t find titan x bios yet. Guess I have to live with it for a while.
I saw you have plan to release deep learning library in the future. What framework will you be working on? Torch7, Theano, Caffe?
Great guide Tim, thanks.
I am wondering if you get the display output from the same GPUs which you do the computation on?
I’m gonna buy a 40 lane i7 cpu, which is a LGA 2011 socket, along with a GTX 980. It seems that none of the CPUs with this socket have an internal GPU to drive display. And the other CPUs, LGA 1150 and LGA 1155, do not support more than 28 lanes.
So , the question is do I need a separate GPU to drive displays, or I can do the compute and run the displays on the same GPU?
You can use the same GPU for computation and for display, there will be no problem. The only disadvantage is, that you have a bit less memory. I use 3x 27 inch monitors at 1920×1080 and this config uses about 300-400 MB of memory which I hardly notice (well, I have 6GB of GPU memory). If you are worried about that memory you can get a cheap NVIDIA GT210 (which can hold 2 monitors) for $30 and run your display on that, so that your GTX 980 is completely free for CUDA applications.
I realize this is an old post but what motherboard did you pick? Most LGA2011 seem to not support dual 16x which I thought was the attraction of the 40 pcie lanes.
Got a bit of a compromise i am thinking about. To save on cash in picking a CPU. The i7 5820K and i7 5930K are the same except for pci lanes (28 versus 40). According to this video:
https://youtu.be/rctaLgK5stA
It comes down to using say a 4th 980 or Titan otherwise if it’s three or less then there is no real performance difference. This means a saving on the CPU of about $200.
What’s your thoughts since you warned about the i7 5820 in your article?
Yes the i7 5820K only has 28 PCIe lanes and if you buy more than one GPU I would choose definitely a different CPU . The penalty will be observable when you use multiple GPUs especially if you will use 4x GTX980 (personally, I would choose a cheap CPU < $250 with 40 lanes and instead buy 4x Titan GTX X — that will be sufficient) One note though, remember that in 2016 Q3/Q4 there will be Pascal GPUs, which are about 10 times better than a GTX Titan X (which is 50 % better than a GTX 980), so it might be reasonable to go with a cheaper system and go all out once Pascal GPUs are released.
Well if i buy now in terms of the CPU and motherboard then I would like to upgrade this system in a couple years to Pascal. To keep this base system current over a few years then would you still recommend a x99 motherboard? If so then I am stuck with only two choices 5930 or 5960.
AMD has cpu’s and associated motherboards but I am not familiar with anything going that direction. Do they have something in mind here that is cheaper, about the same performance and can handle up to 4 980/titan/pascal GPU’s?
BTW, thought i read somewhere that no current motherboard will handle Pascal, is that correct?
A x99 motherboard might be a bit overkill. You will not need most of its features like DDR4 RAM. As you said, the Pascal GPUs will use their own interconnect which is much faster than PCIe — this would be another reason to spend less money on the current system. A system based on either the LGA1150 or the LGA2011 would be a good choice in terms of performance/cost.
I do not have experience with AMD either, but from the calculations in my blog post I am quite certain that it would also be a reasonable choice. I think in the end it just comes down how much money you have to spare.
Great thanks! Still one thing remain unclear to a newbie builder like me. Is an x99 chip set wed to only motherboards which will not work with Volta/Pascal? If not then I can just swap out the motherboard but keep the x99 compatible CPU, memory, etc.
Also, since you are writing about convolutional nets, these are front-ends the feed neural nets. However, there is a new paper on using an SVM approach that needs less memory, is faster and just as accurate as any state-of-the-art covnet/neural-net combo. It keeps the convolution and pooling layers but replaces the neural net with a new fast-food (LOL) version of SVM. They claim it works “better”
“Deep Fried Convnets” by Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alex Smola, Le Song, Ziyu Wang.
The SVM versus neural-net battle continues.
Hi Tim, this is a great post!
I’m interested in the actual PCIe bandwidth in the deep learning process. Are PCIe 16 lanes needed for deep learning? Of course x16 PCIe gen3 is ideal for the best performance, but I’m wondering if x8 or x4 PCIe gen3 is also enough performance.
Which do you think better solution if the system has 64 PCIe lanes?
* 4 GPGPUs connected with 16 PCIe lanes each
* 16 GPGPUs connected with 4 PCIe lanes each
Which is important factor, the number of GPGPU (calculation power) or PCIe bandwidth?
Each PCIe lane for PCIe 3.0 has a theoretical bandwidth of about 1 GB/s, so you can run GPUs also with 8 lanes or 4 lanes (8 lanes is standard for at least one GPU if you have more than 2 GPUs), but it will be slower. How much slower will depend on the application or network architecture and which kind of parallelism is used.
64 PCIe lanes are only supported by two CPU motherboards and these boards often have a special PCIe switching architecture which connects the two separate PCIe systems (one for each CPU) with each other; I think you can only run up to 8 GPUs with such a system (the BIOs often cannot handle more GPUs even if you have more PCIe slots). But if you take this as a theoretical example it is best to just do some test calculations:
16 GPUs means 15 data transfers to synchronize information; 4 PCIe lanes / 15 transfers = 0.2666 GB/s for a full synchronization. If you now have a weight matrix with say 800×1200 floating point numbers you have 800x1200x1024^-3= 0.0036 GB. This means you could synchronize 0.2666/0.0036 = 74 gradients per second. A good implementation of MNIST with batchsize 128 will run with about 350 batches per second. So the result is that 16 GPUs with 4 PCIe lanes will be 5 times slower for MNIST. These numbers are better for convolutional nets, but not much better. Same for 4 GPUs/16 lanes:
16/3 = 5.33; 5.33/0.0036 = 647; so in this case there would be a speedup of about 2 times; this is better for convolutional nets (you can except a speedup of 3.0-3.9 depending on the implementation). You can do similar calculations for model parallelism in which the 16 GPU case would fare a bit better (but it is probably still slower than 1 GPU).
So the bottom line is that 16 GPUs with 4 PCIe lanes are quite useless for any sort of parallelism — PCIe transfer rates are very important for multiple GPUS.
Thank you for explanation.
Regarding your description, it depends on application, but the data transfer time among GPUs is dominant in multiple GPUs environment.
However, I have another question.
In your assumption, the GPU processing time is always shorter than data transfer time. In 16 GPUs case, GPU processing must take less than 14 msec to process one batch. In 4 GPUs case, it must take less than about 2 msec.
If the GPU processing time is longer enough than data transfer time, the data transfer time for synchronization is negligible. In that case, it is important to have many GPUs rather than PCIe bandwidth.
Is my assumption unlikely in usual case?
This is exactly the case for convolutional nets, where you have high computation with small gradients (weight sharing). However, even for convolutional nets there are limits to this; beyond eight GPUs it can quickly become difficult to gain near-linear speedups, which is mostly due low interconnections between computers. A 8 GPU system will be reasonably fast with speedups of about 7-8 times for convolutional nets, but for more than 8 GPUs you have to use normal interconnects like infiniband. Infiniband is similar to PCIe but its speed is fixed at about 8-25 GB/s (8GB/s is the affordable standard; 16 GB/s is expensive; 25 GB/s is very, very expensive): So for 6 GPUs + 8GB/s standard connection this yields a standard bandwidth of 1.6 GB/s which is much worse than the 4 GPU 16 lanes example; for 12 GPUs this is 0.72 GB/s; 24 GPUs 0.35GB/s; 48 GPUs 0.17GB/s. So pretty quickly it will be pretty slow even for convolutional nets.
I overlooked your comment, but it is actually a very good question. It turns out that you exactly hit the mark: The less communication is needed the better are more GPUs compared to more bandwidth. However, in deep learning there are only few cases where it makes sense to trade bandwidth for more GPUs. Very deep recurrent neural networks (time dimension) would be an example, and to some degree (very) deep neural networks (20+ layers) are of this type. However, even for 20+ layers you still want to look at maximizing your bandwidth to maximize your overall performance.
For dense neural networks, anything above 4 GPUs is rather impractical. You can make it work to run faster, but this required much effort and several compromises in model accuracy.
Thanks for a great guide! I’m wondering if you could give me a rough estimate of the performance boost I would get by upgrading my system? Would be awesome to have that before I spend my hard-earned money! I supposed it’s mainly based on my current GPU, but here’s a bit of info about the rest of the system as well.
Current setup:
ATI Radeon™ HD 5770 1gb
One of the last CPU’s from the 775-socket series.
4gb ram
SSD
Upgraded setup:
GTX 960 4gb
Modern dual-thread CPU with 2+ GHz
8gb ram
SSD
Two more questions:
1) I’ve sometimes experienced issues between different motherboard brands and cetain GPU’s. Do you have a recommendation for a specific motherboard brand (or specific product) that would work well with a GTX 960?
2) Any idea of what the performance reduction would be by doing deep learning in caffe using a Virtualbox environment of Ubuntu instead of doing a plain Ubuntu installation?
It is difficult to estimate the performance boost if your previous GPU is a ATI GPU; but for the other hardware pieces you should see about a 5-10% increase in performance.
1. I never had any problems with my motherboards, so I cannot give you any advice here on that topic.
2. I also had this idea once, but it is usually impossible to do this: CUDA and virtualized GPUs do not go together, you will need specialized GPUs (GRID GPUs, which are used on AWS); even if they would go together there would be a stark performance decrease.
It it a great change to go from windows to ubuntu, but it is really worth doing if you are serious about deep learning. A few months in ubuntu and you will never want to go back!
Thanks for the quick response! I’ll try Ubuntu then (perhaps some dual-booting). Would it make sense to add water-cooling to a single GTX 960 or would that be overkill?
Tim,
Thanks for a great write-up. Not sure what I’d have done without it.
A bit of a n00b question here,
Do you thinks it matters in practice if one has PCI2 2.0 or 3.0?
Thanks
If it is possible that you will have a second GPU at anytime in the future definitely get a PCIe 3.0 CPU and motherboard. If you use additional GPUs for parallelism, then in the case of PCIe 2.0 you will suffer a performance loss of about 15% for a second GPU, and much larger losses (+40%) for your third and fourth GPU. If you are sure that you will stay with one GPU in the future, then PCIe 2.0 will only give you a small or no performance decrease (0-5%) and you should be fine.
This may not make much difference if you care about a new system now or about having a more current system in the future. However, if you want to keep it around for years and use it for other things besides ML then wait a few months.
Intel’s Skylake CPU will be released in a few months along with it’s new chip set, new socket, new motherboards etc. All PCI 3, ddr4, etc. It’s considered a big change compared to prior CPU’s. Skylake prices are suppose to be similar to current offerings but retailers say they expect the price of ddr4 to drop. Don’t really understand why but gamers are also waiting for the release … maybe just because “new and improved” since it doesn’t seem to translate into a big plus for the gaming experience.
Hi Tim,
Thanks for the insightful posts. I’m a grad student working in the image processing area. I just started to explore some deep learning techniques with my own data. My dataset contains 10 thousand 800*600 images with 50+ classes. I’m wondering GTX970 will be sufficient to try different networks and algorithms, including CNN.
Although your data set is very small and you will only be able to train a small convolutional net before you overfit the size of the images is huge. Unfortunately, the size of the images is the most significant memory factor in convolutional nets. I think a GTX 970 will not be sufficient for this.
However, keep in mind, that you can always shrink the images to keep them manageable. for a GTX 970 you will need to shrink them to about 250*190 or so.
Thanks for the quick reply. Look forward to your new articles.
Hi Tim,
thank you for your great article. I think it’s cover everything that you need to know to start your journey with DL.
I’m also grad student (but instead of image processing, I’m in speech processing) and want to buy some machine (I thinking also about Kaggle, but for beginning I could take 20-40 place 🙂 ). I want to buy (East Europe) used workstation (without graphics) + used graphics. Probably I will end up with 2 cards in my computer… Maybe 3….
Questions:
1) You wrote that you need to have 7 3.0 slots motherboard for 3 GPUs. Isn’t possible to have
16 x | 1x | 16x | 1x (etc) setup? Like in http://www.msi.com/product/mb/Z87-G45-GAMING.html#hero-overview?
2) So there do not exist setups that support 16x/16x (or are to expensive)?
3) I see that computation compatibility also matters. I can buy geforce 780 ti in similar price to gtx 970. 780 ti has better bandwith + more GFLOPS (you never mentioned about FLOPS), but 970 has newer CC + more memory.
4) Maybe I should let go and buy what… 960 or 680 (just start)… However, 970 is not much expensive than those 2. Or just buy whole used PC.?
Tim, what do you think?
1. You are right, a 16x | 1x | 16x | 1x setup will work just as well; I did not thought about that in this way, and I will update my blog with that soon — thanks!
2. I hope I understand you right: You have a total of 40x PCIe lanes supported by your CPU (not the physical slots, but this is sort of the communication wires that are layed from the PCIe slots to the CPU) and your GPUs will use up to 16x (standard mainboards) for that; so 16x 16x is standard if you use 2 GPUs, for 3 GPUs this is 16x8x16 and for 4 GPUs 16x8x8x8. If you mean physical slots, then a 16x | Yx | 16x setup will do, where Y is any size; because most GPUs have a width of two PCIe slots you most often cannot run 2 GPUs on 16x | 16x mainboard slots, sometimes this will work if you use watercooling though (reduces the width to one slot)
3. GFLOPS do not matter in deep learning (its virtually the same for all algorithms), your algorithms will always be limited by bandwidth; the 780 TI has higher bandwidth, but inferior architecture and the GTX 970 would be faster. However, the GTX 780 TI has no gliches, and so I would go with the GTX 780 TI
4. The GTX 680 might be a bit more interesting than the GTX 780 TI if you really want to train a lot of convolutional nets; otherwise a GTX 780 TI is best; if you only use dense networks you might want to go with the GTX 970
Tim,
Thanks for the excellent guide! It has helped us a lot. However, a few questions remain…
We plan to build a deep-learning machine (in a server rack) based on 4 Titan cards. We need to select other hardware. Ideally we would put all four cards on a single board with 4x PCIe 3.0 x16. The questions are:
1. If I understand correctly, GPU intercommunication is the bottleneck. Should we go for dual 40-lane CPUs (Xeons only, right?), or take a single i7 and connect the cards with SLI?
2. Will any 4x PCIe 3.0 x16 motherboard do? Is socket 2011 preferable?
We plan to use these nets for both constitutional and dense learning. Our budget (everything except the Titans) is around $3000, preferably less, or a bit more if justified. Please advise!
I just read the above post as well and got some needed information, sorry for spamming. From what I understand, SLI is not beneficial.
Should we then go for two weaker Xeons (2620), each with 40 PCIe lanes? Will this be cost-optimal?
Thanks,
F
2 CPUs will typically yield no speedup because usually the PCIe networks of each CPU (2 GPUs for each CPU) are disconnected which means that the GPU pairs will communicate through CPU memory (max speed about 4 GB/s, because a GPU pair will share the same connection to the CPU on a PCIe-switch). While it is reasonable for 8 GPUs, I would not recommend 2 CPUs for a 4 GPU setup.
There are motherboards that work differently, but these are special solutions which often only come in a package of a whole 8 GPU server rack ($35k-$40k).
If you use a single GPU then any motherboard with enough slots and which supports 4 GPUs will do; choose the CPU so that it supports 40 PCIe lanes and you will be ready to go. Socket 2011 has no advantage over other sockets which fulfill these requirements.
Regarding SLI: SLI can be used for gaming, but not for CUDA (it would be too slow anyways); so communication is really all done by PCI Express.
Hope this helps!
It does, thanks! We are still deciding between a single CPU vs dual-CPU (for other computing purposes). Could you comment on the following two motherboards being suitable for our 4 titans:
http://www.asus.com/us/Commercial_Servers_Workstations/X99E_WS/overview/
http://www.asus.com/Commercial_Servers_Workstations/Z10PED8_WS/overview/
In particular the Z10PED8 states it supports “4 x PCIe 3.0/2.0 x16 (dual x16 or quad x8)”, from which I understand it does NOT support quad x16. Would the X99 be the best solution then?
It is quite difficult to say which one is better, because I do not know the PCIe switch layout of the dual CPU motherboard. The most common PCIe switch layout is explained in this article and if the dual CPU motherboard that you linked behaves in a similar way, then for deep learning 2 CPUs will be definitely be slower than 1 CPU if you want to use parallel algorithms across all 4 GPUs; in that case the 1 CPU board will be better. However, this might be quite different for other computing purposes than deep learning and a 2 CPU board might be better for those tasks.
Hi Tim,
Thank you for all your advice on how to build a machine for DL!
You don’t talk about the possibility of using an embedded GPU in the motherboard (or a “small” second GPU) so as to dedicate the “big” GPU to calculus. Could that affect the performance in any way?
Also we want to build a computer to reproduce and improve -by making a more complex model- the work of DeepMind about their generalist AI.
We were thinking about getting one Titan X with 32G of RAM.
Would you have any specific recommendation concerning the motherboard and CPU?
Thank you very much
There are some GPUs which are integrated (embedded) in regular CPUs and you can run your monitors on these processors. The effect of this is some saved memory (about a hundred MB for each monitor) but very little computational resources (less than 1 % for 3 monitors). So if you are really short on memory (say you have a GPU with 2 or 3GB and 3 monitors) then this might make good sense. Otherwise, it is not very important and buying a CPU with integrated graphics should not be a deciding factor when you buy a CPU.
As I said in the article, you have a wide variety of options for the CPU and motherboard, especially if you will stick with one GPU. In this case you can really go for very cheap components and it will not hurt your performance much. So I would go for the cheapest CPU and motherboard with a reasonable good rating on pcpartpicker.com if I were you.
Thank you very much!
Hi Tim,
First can I say thanks very much for writing this article – it has been very informative.
I’m a first year PhD student. My research is concerned with video classification and I’m looking into using convolutional nets for this purpose.
My current system has a Gt 620 which takes about 4 hours to run a lenet5 based network built using theano on MNIST. So I’m looking to upgrade and I have about £1000 to do it with.
I’ve allocated about £500 for the gpu but I’m struggling to decide what to get. I’ve discounted the gtx 970 due to the memory problems. I was thinking either gtx 780 (6gb asus version), gtx 980 or two gtx 960’s. What is your opinion on this? I know I can’t use multiple gpus with theano but I could run two different nets at the same time on the 960’s, however would it be quicker just to run each net consecutively on the 980 since its faster. Also there’s the 780 which although would be slower than the 980 it has more ram which would be beneficial For convolutional nets. I looked into buying second hand as you suggested however I’m buying through my university so that isn’t an option.
Thanks for your help and for the great article once again.
Cheers,
Richard
That is really a tricky issue, Richard. If you use convolutional on the spatial dimensions of an image as well as the time dimension, you will have 5 dimensional tensors (batch size, rows, columns, maps, time) and such tensors will use a lot of memory. So you really want a memory card. If you use the Nervana Systems 16-bit kernels you would be able to reduce memory consumption by half; these kernels are also nearly twice as fast (for dense connections there are more than twice as fast). To use the Nervana Systems kernels, you will need a Maxwell GPU (GTX Titan X, GTX 960, GTX 970, GTX 980). So if you use this library a GTX 980 will have “virtually” 8GB of memory, while the GTX 780 has 6GB. The GTX 980 is also much faster than the GTX 780, which further adds to the GTX 980 options. However, the Nervana Systems kernels still lack some support for natural language processing, and overall you will have a far more thorough software if you use torch and a GTX 780. If you think about adding your own CUDA kernels, the Nervana Systems + GTX 980 option may be not so suitable, because you probably will need to handle the custom compiler and program 16-bit floating point kernels (I have not looked at this, but I believe there will be things which makes it more complicated than regular CUDA programming).
I think both, GTX 780 and GTX 980 are good options. The final choice is up to you!
Hope this helps!
Cheers,
Tim
Thanks for the detailed response Tim,
Think i’ll go with the 780 for now due to the extra physical memory. Quick follow up question: if I have the money for an additional card in the future would I need to buy the same model. Could I for example have both a GTX 780 and a GTX 980 running in the same machine so that I can have two different models running on each card simultaneously? Would there be any issues with drivers etc? Going to order the parts for my new system tomorrow will post some benchmarks soon.
Cheers,
Richard
GPUs can only communicate directly if they are based on the same chip (but brands may differ). So for parallelism you would need to get another GTX 780, otherwise a GTX 980 is fine for everything else. Also remember, that new Pascal GPUs will hit around Q3 2016 and those will be significantly faster than any Maxwell GPU (3D memory) — so waiting might be an option as well.
FYI on Pascal chip from NVIDIA. Speed up over Titan is “up to 5x.” Of this, a 2x speed up will come from the option of switching to using 16 bit floating point in Pascal.
The rest of the “up to 10x speed up” comes from the 2x speed up you get from NVLink. Here the comparison is two Pascal versus two Titans. I don’t know what the speed up would be if the Pascals used the same PCI interlink as the Titans or if they could even use the PCI interlink. Hopefully so then a new motherboard would not be necessary.
That second 10x speed up claim with NVLink is a bit strange bc it is not clear how it is being made.
That sounds interesting. Would you mind sharing more details about your G3258-based system?
I do not have a Haswell G3258 and I would not recommend one, as it only runs 16 PCIe 3.0 lanes instead of the typical 40. So if you are looking for a CPU I would not pick Haswell — too new and thus too expensive, and many Haswells do not have full 40 PCIe lanes.
Sorry Tim, my comment was meant to be in response to the comment #128 by user “lU” from March 9, 2015 at 10:59 PM. I wonder why it didn’t appear under that one despite having double-checked before posting. I guess it’s the fault of my mobile browser.
First of all, thank you for a series of very informative posts, they are all much appreciated.
I was planning to go for a single GPU system (GTX 980 or the upcoming 980 Ti) to get started with deep learning, and I had the impression that at $72, this is the most affordable CPU out there.
You’re welcome! I was looking for other options, but to my surprise there were not any in that price range. If you are using only a single GPU and you are looking for the cheapest option, this is indeed the best choice.
What are your thoughts on the GTX 980 Ti vs. the Titan X? I guess with “980” in your article you referred to the 4 GB models. The 980 Ti has the same Memory Bandwidth as the Titan X, 2GB more memory than a 980 (which should make it better for big convnets), only a few CUDA cores less. And the price difference is 549 USD for a 980 Ti vs 999 USD for the Titan X.
The GTX 980 Ti is a great card and might be the most cost effective card for convolutional nets right now. The 6GB RAM on the card should be good enough for most convolutional architectures. If you will be working on video classification or want to use memory-expensive data sets I would still recommend a Titan X over a 980 Ti.
Hey Tim! Thanks for these posts, they’re highly, highly appreciated! I’m just starting to get my feet wet in deep learning – is there any way to hook up my Laptop to a GPU (maybe even an external one?) without having to build a PC from scratch so I could start GPGPU programming on small datasets with less of an investment? Does the answer depend on my motherboard?
In that case it will be best to use AWS GPU spot instances which are cheap and fast. External GPUs are available, but they are not an option because the data transfer, CPU -> USB-like-interface -> GPU, is too slow for deep learning. Once you have made some experiences with AWS I would then buy a dedicated deep learning PC.
Thanks, that sounds like a good idea then!
Hi Tim!
Thanks for your helpful and detailed write-up.
It seems from this blog post (http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/) that for concurrent kernel execution and data transfer the memory must be pinned.
You wrote “…one might be able to prevent that when one uses pinned memory, but as you shall see later, it does not matter anyway…” and AFAIU you don’t use pinned memory in the async batch allocation process (`clusternet` project).
Also, pinned memory is mentioned in the documentation (http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#asynchronous-transfers-and-overlapping-transfers-with-computation), but at the same time this (http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g85073372f776b4c4d5f89f7124b7bf79) document says “…the copy may overlap with operations in other streams…” and no mention about pinned memory.
These contradictory facts are a bit confusing to me. So my question is:
– Do you have the code that can confirm overlapping between transfer process of pageable host memory to the device memory and kernel execution?
– And what is actually going on with cudaMemcpyAsync?
What you write is all true, but you have to look at it in two different ways, (1) CPU -> GPU, and (2) GPU -> GPU.
For CPU -> GPU you will need pinned memory to do asynchronous copies; however, for GPU -> GPU the copy will be automatically asynchronous in most use cases — no pinned GPU memory needed (cudaMemcpy and cudaMemcpyAsync are almost always the same for GPU -> GPU transfers).
I turns out that I use pinned memory in my clusterNet project, but it is a bit hidden in the source code: I use it only for batch buffers in my BatchAllocator class, which has an embarrassingly poor design. There I transfer usual CPU memory, to a pinned buffer (while the GPU is busy) and then transfer it in another step asynchronously to the GPU, so that the batch is ready when the GPU needs it.
You can also allocate the whole data set as pinned memory, but this might cause some problems, because once pinned, the OS cannot “optimize” the locked in memory anymore which may lead to performance problems if one allocated a chunk of memory too large.
Thank you for the reply.
Do you know what is the reason for the inability to have overlapping pageable host memory transfer and kernel execution?
It has all to do with having a valid pointer to the data. If your memory is not pinned, then the OS can push the memory around freely to make some optimizations, so you are not certain to have a pointer to CPU memory and thus such transfers are not allowed by the NVIDIA software because they easily run into undefined behaviour. With pinned memory, the memory no longer is able to move, and so a pointer to the memory will stay the same at all times, so that a reliable transfer can be ensured.
This is different in GPUs, because GPU pointers are designed to be reliable at all times as long as they stay on some GPU memory, so these problems do not exist for GPU -> GPU transfers.
Thanks for the wonderful explanation. But I still have a question. Your previous reply can explain why data transfer with pageable memory can’t be asynchronous with respect to a host thread, but I still do not understand why a device can’t execute kernel while copying data from a host. What is the reason for that?
Kernels can execute concurrently, the kernel just needs to work on a different data stream. In general, each GPU can have 1 Host to GPU, and 1 GPU to GPU transfer active, and execute a kernel concurrently on unrelated data in another stream (by default all operations use a default stream and are thus not concurrent).
But you are right that you cannot execute a kernel and a data transfer in the same stream. I assume there are issues with the hardware not being able to resume a kernel once the end of a steam that is being transferred at the very moment is reached (the kernel would need to wait, then compute, then wait, then compute, then wait… — this will not deliver good performance!). So it will be because of this that you cannot run a kernel on partial data.
Sorry that my question was confusing.
I write simple code which runs axpy cublas kernels and memcpy. As you can see from the profiler , in case of pinned memory the kernels that were launched after cudMemcpyAsync run in parallel (with respect to transfer process).
However, in case of pageable memory,  cudMemcpyAsync blocks the host, and I can’t launch the next kernel.
In the chapter `Direct memory access (DMA)` you say “…on the third step the reserved buffer is transferred to your GPU RAM without any help of the CPU…”, so why does cudMemcpyAsync block the host until the end of the copy process? What is the reason for that?
The most low-level reason I can think of is, as I said above, that pageable memory is inherently insecure and may be swapped/pushed around at will. If you start a transfer and want to make sure that everything works, it is best to wait until the data is fully received. I do not know about the low level details how the OS and its drivers and routines (like DMA) interact with the GPU. If you want to know these details, I think it would be best to consult with people from NVIDIA directly, I am sure they can give you a technical accurate answer; you might also want to try the developer forums.
Do you think if you have too many monitors, it will occupy too much resources of your GPU already? If yes, how to solve this issue?
I have three monitors with 1920×1080 resolution and the monitors draw about 400 MB. For me I never had any issues with this, but I also had 6GB cards and I did not train models that maxed out my GPU RAM. If you have a GPU with less memory (GTX 980 or GTX 970) then there might be some problems for convolutional nets. The best way to circumvent this problem is to buy a really cheap GPU for the monitors (a GT210 costs about $30 and can power two (three?) monitors), so that your main deep learning GPU is not attached to any monitor.
Tim, you have a wonderful blog and I am very impressed with the knowledge as well as the effort that you are putting into it.
I run a silicon valley startup that works in the space of wearbales Bio-sensing , we developed very unique non-invasive sensors , that can measure vitals , psychological and physiological effects. Most of our signals are multivariate time series, with a typically process (1×3000) per sensor per reading , and we can typically use up to 5 sensors.
We are currently expanding our ML algorithms to add CNNs capabilities, I wonder what do you recommend in terms of GPU.
Also I would highly appropriate if you can email me to further discuss potentially mutually beneficial collaboration
Regards,
Sameh
Hi Sameh! If you have multivariate time series a common CNN approach is to use a sliding windows over your data on X time steps. Your convolutional net would then use temporal instead of spatio-temporal convolution which would use much less memory. As such, 6GB of memory should probably be sufficient for such data and I would recommend a GTX 980 Ti, or a GTX Titan. If you need to run your algorithms on very large sliding windows (an important signal happened 120 time steps ago, to which the algorithm should be sensitive to) a recurrent neural network would be best for which 6GB of memory would also be sufficient. If you want to use CNNs with such large windows it might be better to get a GTX Titan X with 12GB memory.
Regards,
Tim
Tim,
I am new to deep NN. I discovered its tremendous progress after seeing the excellent 2015 GTC NVidia talk. Deep NN will be very useful for my Phd which is about electrical brain signal classification (Brain Computer Interface).
What a joy I found your blog! Just wished if you wrote more.
All your post are full of interesting ideas. I have checked the comments of the posts which are not less interesting than the posts themselves and full of important hints too.
I read a lot, but did not find most of your interesting hints on hardware elsewhere. Your posts were just brilliant. I believe your posts filled a gap in the web, especially on the performance and the hardware side of deep NN.
I think on the hardware side, after reading your posts I have enough knowledge to build a good system.
On the software side, I found a lot of resources. However, I am still a bit confused. Perhaps, because it wasn’t your posts 😉 . Why do you only write on hardware? Your can write very well, and we love to hear from your experience on software too..
From where should I begin?
I’m very fond of Matlab and didn’t program much in other languages. And I don’t know anything about python, which seems very important to learn for machine learning. I don’t mind to learn python if you advise me to do so. But if it is not necessary, then maybe I can spare my time to learn other deep NN stuff, which are overwhelming already. My excitement crippled me. I have opened ~600 tabs and want to see them all.
If you were in my shoes, what platform you will begin to learn with? Caffe, Torch or Theano ? Why?
And please, tell me too about your personal preference. I learned from your posts that you are making your own programs. But in case you are picking one of these for you, what will be. And in case you were like me with no python experience, what will you pick in that case?
I am very interested to hear your opinion. I am not in hurry.. When you feel like writing please answer me with some details.
I thank you sincerely for all the posts and comment replies in your blog and eager to see more posts from you Tim!
Thank you!
Thank you for all this praise — this is encouraging! I wrote about hardware mainly because I myself focused on the acceleration of deep learning and understanding the hardware was key in this area to achieve good results. Because I could not find the knowledge that I acquired elsewhere on a single website, I decided to write a few blog posts about this. I plan to write more about other deep learning topics in the future.
In my next posts I will compare deep learning to the human brain: I think this topic is very important because the relationship between deep learning and the brain is in general poorly understood.
I also wanted to make a blog post about software, but I did not have the time yet to do that — I will do so probably in this month or the next.
Regarding your questions, I would really recommend Torch7, as it is the deep learning library which has the most features and which is steadily extended by facebook and deepmind with new deep learning models from their research labs. However, as you posted above, it is better for you to work on windows and Torch7 does not work well on windows. Theano is the best option here I guess, but also Minerva seems to be okay.
Caffe is a good library when you do not want to fiddle around to much within a certain programming language and just want to train deep learning models; the downside is that it is difficult to make changes to the code and the training procedure/algorithm and few models are supported.
In the case of brain signals per se, I thin python offers a lot of packages which might be helpful for your research.
However, if you just want to get started quickly with the language you know, Matlab, then you can also use the neural network bindings from the Oxford research group, with which you can use your GPU to train neural networks within Matlab.
Hope this helps, good luck!
Hi Tim,
Thank for your support on Deep Learning group.
I have a workstation DELL T7610 http://www.dell.com/sg/business/p/precision-t7610-workstation/pd.
I want to plug in 2 Titan X from NVIDIA and ASUS. Everything seems okay, I just wonder about PSU, cooling, and dimensions of GPU.
I will check the cooling and dimensions latter. My main concerns is about power.
I look documents http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-titan-x/specifications and https://www.asus.com/Graphics-Cards/GTXTITANX12GD5/specifications/.
Both of them requires power up to 300W.
However in the specs of workstation, they said sth about graphics card that:
Support for up to three6 PCI Express® x16 Gen 2 or Gen 3 cards up to 675W (total for graphics (some restrictions apply))
GPU: One or two NVIDIA Tesla® K20C GPGPU – Supports Nvidia Maximus™ technology.
So the total power seems okay, right?
Another evidences:
The power of the workstation would be:
Power Supply: 1300W (externally accessible, toolless, 80 Plus® Gold Certified, 90% efficient)
CPU ( 230W ) + 2GPU( 300*2 ) + 300 = 1130W.
It seems okay for the power, right?
Hope to have your opinions.
Thank you for your sharing.
Everything looks fine. I ran 3 GTX Titan with a 1400 watt PSU and 4 GTX Titan with 1600 watt, so you should definitely be fine with 1300 watt and 2 GPUs. A GTX Titan also uses more power than a GTX Titan X. Your calculation looks good and there might even be space for a third GPU.
P.S. The comments are locked in place to await approval if someone new posts on this website. This is so to prevent spam.
Will Ecc RAM make Convolution NN or deep learning more efficient or better? In another word, if the same money can buy me one PC with Ecc RAM vs TWO PC without Ecc RAM, which should I pick for deep learning?
I think ECC memory only applies to 64 bit operations and thus would not be relevant to deep learing, but I might be wrong.
ECC corrects if a bit is flipped in the wrong way due to physical inconsistencies at the hardware level of the system. Deep learning was shown to be quite robust to inaccuracies, for example you can train a neural network with 8-bits (if you do it carefully and in the right way); training a neural network with 16-bit works flawlessly. Note that training on 8 bit for example, will decrease the accuracy for all data while ECC is relevant only for some small parts of the data. However, a flipped bit might be quite severe while a conversion from 32 to 8-bits might still be quite close to the real value. But overall I think an error in a single bit should not be so detrimental to performance, because the other values might counterbalance this error or in the end softmax will buffer this (an extremely large error value sent to half the connections might spread to the whole network, but in the end for that sample the probability for the softmax will be just 1/classes for each class).
Remember that there are always a lot of samples in a batch, and that the error gradients in this batch are averaged. Thus even large errors will dissipate quickly, not harming performance.
Hi Tim,
1) Great post.
2) Do you know how motherboards with dedicated PCI-E lane controllers shuffle data between GPUs with deep learning software? For example, the PLX PEX 8747 purports control of 48 PCI-E lanes beyond the 40 lanes a top-shelf CPU controls, e.g. allowing five x16 connections, but it’s not clear to me if deep learning software makes use of such dedicated PCI-E lane controllers.
I ask since going beyond three x16 connections with CPU control of PCI-E lanes only requires dual CPU, but such boards along with suitable CPUs can be in sum be thousands of dollars more expensive than a single CPU motherboard that has a PLX PEX 8747 chip. If the latter has as good performance for deep learning software, might as well save the money!
Thanks!
-Charles
That is very difficult to say. I think the PLX PEX 8747 chip will be handled by the operating system after you installed some driver, so that deep learning software would use it automatically in the background. However, it is unclear to me if you really can operate three GPUs in 16/16/16 when you use this chip, or if it will support peer-to-peer GPU memory transfers. I think you will need to get in touch with the manufacturer for that.
Hi Tim, makes sense. Thanks for the reply.
I’ll need to dig more. I’ve seen various GPU-to-GPU benchmarks for server-grade motherboards (e.g. in HPC systems), including a raw ~ 7 GB/s using a PLX PEX chip (lower than host-to-GPU), but I’ve had difficulty finding benchmarks for single-CPU boards, let alone for more than three x16 GPU connections.
If you come across a success story of a consumer-grade single-CPU system with exceptional transfer speed (better than 40 PCI-E 3.0 lanes worth in sum) between GPUs when running common deep learning software/libraries, or even a system with such benchmarks for raw CUDA functions, please update.
In the meantime, I look forward to your other posts!
Best,
Charles
AMD’s Naple CPU is expected to provide 128 lanes: 64 lanes for 4 PCIe expansion cards at x16 and the remaining for CPU-to-CPU interconnect (called Infinity Fabric).
Source:
https://arstechnica.co.uk/information-technology/2017/03/amd-naples-zen-server-chip-details/
In another article, it is implied that with 1xCPU systems, 128 lanes will be available for I/O, presumably allowing for full x16 lanes on up to 8 GPUs, or for use with NVLink bridges.
Source:
http://www.anandtech.com/show/11183/amd-prepares-32-core-naples-cpus-for-1p-and-2p-servers-coming-in-q2
Very Useful information indeed, Tim.
I have a newbie question: If the motherboard has integrated graphics facility, and if the GPU is to be dedicated to just deep learning, should the display monitor be connected directly to the motherboard rather than the GPU?
I have just bought a machine with GeForce Titan X card and they just sent me a e-mail saying:
“You have ordered a graphics card with your computer and your motherboard comes supplied with integrated graphics. When connecting your monitor it is important that you connect your monitor cable to the output on the graphics card and NOT the output on the motherboard, because by doing so your monitor will not display anything on the screen.”
Intuitively,it seems that off-loading the display duties to the motherboard will free the GPU to do more important things. Is this correct? If so, do you think that this can be done simply? I would ask the supplier, but they sounded lost when I started talking about deep learning on Graphics cards.
Regards
Xardoz
Hi Xardoz! You will be fine when you use connect your monitor to your GPU especially if your using a GTX Titan X. The only significant downside of this is some additional memory consumption which can be a couple of hundred MB. I have 3 monitors connected to my GPU(s) and it never bothered me doing deep learning. If you train very large convolutional nets that are on the edge of the 12GB limit, only then I would think about using the integrated graphics.
Thanks Tim.
It seems that my motherboard graphics capability (Asus Z97-P with an Intel i7-4790k) is not available if a Graphics card is installed.
And yes I do need more than 12GB for training a massive NN! so I decided to buy a small Graphics card just to run the display as suggested in one of your comments above. Seems to work fine.
Regards
Hi Tim, very nice sharing. I just would like to comment on the ‘silly’ parts (smile): the monitors. Since I only have one monitor, I just use NoMachine and put the screen in one of my virtual workspaces in ubuntu to switch between the current machine and our deep learning servers. Surprisingly this is more convenient and energy efficient both for the electricity and our neck movement. Just hope this would help especially those who only have single monitor. Cheers.
Thanks for sharing your working procedure with one monitor. Because I got a second monitor early, I kind of never optimized the workflow on a single monitor. I guess when you do it well, as you do, one monitor is not so bad overall — and it is also much cheaper!
So, I do a research on deep learning hardware, I assume the most appropriate Part list is:
Motherboard: X10DRG-Q – This is an dual socket board which alow you to double the lane of the cpu. It has 4x fully functional x16 PCI Ex 3.0 Slot and an extra 4 x PCI Ex 2.0 Slot for a Mellanox card.
CPU: 2X E5-2623
Network card: Mellanox ConnectX-3 EN Network Adapter MCX313A-BCBT
Star of the show: 4x TitanX
Assume the other parts are $1000, total cost would be $7,585, half the price of the Nvidia Dev box. My god NVIDIA.
This sounds like a very good system. I was not aware of the X10DRG-Q motherboard; usually such mainboards are not available for private customers — this is a great board!
I do not know the exact topology of the system compared to the Nvidia Dev box, but if you have two CPUs this means you will have an additional switch between the two PCIe networks and this will be a bottleneck where you have to transfers GPU memory through CPU buffers. This makes algorithms complicated and prone to human error, because you need to be careful how to pass data around in your system, that is, you need to take into account the whole PCIe topology (on which network and switch the infiniband card sits etc., on which network the GPU sits etc.). Cuda convnet2 has some 8 GPU code for a similar topology, but I do not think it will work out of the box.
If you can live with more complicated algorithms, then this will be a fine system for a GPU cluster.
I got it, so stick to the old plan then, Thank you any way.
Hi Tim
Fortunately, Supermicro provides me X10DRG-Q mobo diagram, and it would be also a gerneral diagram for other 2011 dual socket mobo which has 4 or morethan 4 PCIEX slot. 2 CPU are connected by 2 QPI – Intel QuickPath Interconnect. If cpu1 has 40 lanes, then 32 lane for 2 PCI ex 16, 4x for 10Gigabit Lan, 4x for a 4x PCI ex (8x slot shape, which will be cover if you install 3rd graphic card). The 2nd cpu also provide 32 lane for pci express, then 8x will be 8x slot on the top slot (nearest cpu socket). Pretty complicated.
The point when I build a perfect 4×16 PCIex3.0 is that I though the performence gonna be half if the bandwidth go from 16x down to 8x. Do you have any infomation how much performnce different, said a single titan x, on a 16x 3.0 and 16x 2.0?
Yes, that sounds complicated indeed! A 16x 2.0 will be as fast as a 8x 3.0, so the bandwidth is also halved by stepping down to 2.0. I do not think there exists a single solution which is easy and at the same time cheap. In the end I think the training time will not be that much slower if you run 4 GPUs on 8x 3.0 and with that setup you would not run in any programming problems for parallelism and you will be able to use standard software like Torch7 with integrated parallelism — so I would just go for a 8x 3.0 setup.
If you want a less complicated system that is still faster, you can think about getting a cheap InfiniBand FDR card on eBay. That way you would buy 6 cheap GPUs and all hook them up via InfiniBand at 8x 3.0. But probably this will be a bit slower than straight 4x GTX Titan X on 8x 3.0 on a single board.
I’m so sorry the X3 version of Mellanox does not support RDMA but the X4 does
Hi Tim,
First of all, excellent blog! I’m putting together a gpu workstation for my research activities and have learned a lot from the information you’ve provided so …. thanks!! 🙂
I have a pretty basic question. So basic I almost feel stupid asking it but here goes …
Given your deep learning setup which has 3x GeForce Titan X for computational tasks, what are your monitors plugged in to?
I would like a very similar setup to yours (except I’ll have two 29″ monitors) and I was wondering if it’s possible to plug these into the Titan cards and have them render the display AND run calculations.
Or is it better to just have another, much cheaper, graphics cards which is just for display purposes?
I have my monitors plugged into a single GTX Titan X and I experience no side effects from that other than a couple of hundreds MB memory that is needed for the monitors; the performance for CUDA compute should be almost the same (probably something like 99.5%). So no worries here, just plug them in where it works for you (on windows, one monitor would also be an option I think).
Hello Tim,
First of all thanks for always answering my questions and sorry for coming back with more 🙂
Do you think a 980 (4GB) is enough for training current neural nets (alexnet, overfeat, vgg), or would it be wise to go for a 980ti?
PS: I am a PhD student, time for me is cheaper than euros 🙂
Thanks again.
4 GB of memory can indeed be quite short sometimes. If time is cheaper than money, go for a GTX 980Ti, or even better a GTX Titan X.
Hi, I intend to plug in 2 GPU Titan X in my workstation. In the spec of my workstation, it said that it is possible to have up 2 NVIDIA K20 GPUs. In fact, K20 and TitanX are the same size. However, when I get the first Titan X GPU, I measure that if I plug the second in, there is a tiny space between 2 GPUs. I wonder if it is safe for the cooling of the GPU system.
Hope to have your opinion.
Thanks
A very tiny space between GPUs is typical for non-tesla cards and your cards should be safe. The only problem is, that your GPUs might run slower because they reach their 80 degrees temperature limit earlier. If you run a unix system, flashing a custom BIOS to your Titans will modify the fan regulation so that your GPUs should be cool (< 80 degrees C) at all times. However, this may increase the noise and heat inside the room where your system is located. Flashing a BIOS for better fan regulation will most and foremost only increase the lifetime of your GPUs, but overall everything should be fine and safe without any modifications even if you operate your cards at maximum temperature for some days without pause (I personally used the standard settings for a few years and all my GPUs are still running well).
Hi Tim,
Thank for your responses. I read your posts and I remembered an image of a software in Ubuntu to visualize states of GPU. Something that is similar to Task Manager for CPUs. If you have information, please let me know.
Hi Tim,
I just found a way to increase GPU fan in Ubuntu using Nvidia X server settings. The details are in http://askubuntu.com/questions/42494/how-can-i-change-the-nvidia-gpu-fan-speed
Indeed, this will work very well if you have only one GPU. I did not know that there was a application which automatically prepares the xorg config to include the cooling settings — this is very helpful, thank you! I will include that in an update in the future.
I just found a way to increase fan speed of multiple GPUs without flashing. Here is my documentation.
http://antechcvml.blogspot.sg/2015/12/how-to-control-fan-speed-of-multiple.html
This looks great! Thank you!
Hi Tim,
I’m a Caffe user, and since Caffe has recently added support for multiple GPUs, I have been wondering if I should go with a Titan X or with 2 GTX 980. Which of this 2 configurations would you choose? I’m more inclined towards the 2 GTX 980, but maybe there are some downsides with this configuration that I haven’t thought about.
Thanks!
This is relevant. I do not have experience with Caffe parallelism, so I cannot really say how good it is. So 2 GPUs might be a little bit better than I said in the quora answer linked above.
Hi Tim,
Thanks a lot for your great hardware guide!
I’m planning to build a 3 x Titan X GPU setup, which will be more or less running on a constant basis: would you say that water cooling will make a big impact on performance (by keeping the temperatures always below the 80 degrees)?
As the machine will be installed remotely, where I don’t have easy access to it, I’m a bit nervous about installing a water cooling system in such a setup, with the risk of cooling leakage, so the “risk” has to be worth the performance gain 🙂
Do you have any experience with water cooled systems, and would you say that it would be a useful addition ?
Also would you advise a nice tightly fit chassis, or a bigger one which allows better airflow ?
Finally (so many questions :P), would you think 1500 watt with 92-94% efficiency at 100% load should suffice in the case I might use 4 Titan X GPUs, or would it be better to go for a 1600W PSU ?
If your operate the computer remotely, another option is to flash the BIOS of the GPU and crank up the fan to max speed. This will produce a lot of noise and heat, but your GPUs should run slightly below 80 degrees, or at 80 degrees with little performance lost.
Water cooling is of course much superior but if you have little experience with it it might be better to just go with an air cooled setup. I have heard if installed correctly, water cooling is very reliable, so maybe this would be an option when somebody else, how is familiar with water cooling helps you to set it up.
In my experience, the chassis does not make such a big difference. It is all about the GPU fans, and getting the heat out quickly (which is mostly towards the back and not through the case). I installed extra fans for better airflow within the case, but this only make a difference of 1-2 degrees. What might help more are extra backplates and small attachable cooling pads for your memory (both about 2-5 degrees).
I used a 1600W PSU with 4 GTX Titans which need just as much power as a GTX Titan X and it worked fine. I guess 1500W would also work well and 92-94% efficiency is really good. I would try with the 1500W one and if it does not work just send it back.
Thanks for the detailed reponse, I’ve decided to go for:
– Chassis: Corsair Carbide Air 540
– Motherboard: ASUS X99-E WS
– Cpu: Intel(Haswell-e) Core i7 5930K
– Ram: 64GB DDR4 Kingston 2133Mhz
– Gpu: 3 x NVIDIA GTX TITAN-X 12GB
– HD1: 2 X 500GB SSD Samsung EVO
– HD2: 3 X 3TB WD Red in RAID 5
– PSU: Corsair AX1500i (1500Watt)
With a custom build water cooling system for both the Cpu and the 3 Titan X’s, which I hope will let me crank up these babies while keeping the temperature all times below the 80 degrees.
The machine is partly (at least the chassis is) inspired by Nvidia’s recently released DevBox for Deep Learning (https://developer.nvidia.com/devbox), but for almost 1/2 of the price. Will post some benchmarks with the newer cuDNN v3 once its build and all setup.
How did your setup turn out ? I am also looking to either build a box or find something else ready made (if it is appropriate and fits the bill). I was thinking of scaling down the nvidia devbox as well. I also saw these http://exxactcorp.com/index.php/solution/solu_detail/233 which are similar. Very expensive.
Why is there no mention of Main Gear https://www.maingear.com/custom/desktops/force/index.php anywhere? Are they no good? The price seems too good to be true. I have heard that they break down, but I have also heard that the folks at Main Gear are very responsive and helpful.
Thanks for any insight and thanks Tim for the great blog posts!
Hi Tim!
We’ve already asked you for some advice, an it was helpful… We put together a dev-box in the meanwhile, with 4 Titans inside, it works perfectly.
Now we are considering production servers for image tasks. One of them would be classification. Considering the differences between training and runtime (runtime handles a single image, forward prop only), we were wondering if it would be more cost effective to run multiple weaker GPUs, as opposed to fewer stronger ones…. We are reasoning that a request queue consisting of single-image tasks could be processed faster on two separate cards, by two separate processes, then on a single card that is twice as fast. What are your thoughts on this?
We’ve run very crude experiments, comparing classification speed of a single image on a Titan machine, vs. 960M equipped laptops. The results were more or less as we expected: Titans are faster, but only about 2x, whereas Titans are 4x more expensive then a GTX960 (which has significantly more GFLOPS then the 960M). In absoulte terms, classification speed on a weaker card is acceptable, we’re wondering about behavior under heavy load.
F
Hi Florijan!
I think in the end this is a numbers game. Try to overflow a GTX 960M and a Titan with images and see how fast they go and compare that with how fast you need to be. Additionally, it might make sense to run the runtime application on CPUs (might be cheaper and more scalable to run them on AWS or something) and only run the training on GPUs. I think a smart choice will take this into account, and how scalable and usable the solution is. Some AWS CPU spot instances might be a good solution until you see where your project is headed to (that is if a CPU is fast enough for your application).
Tim,
Thanks for your reply. You’re right, it definitively is a numbers game, I guess we will simply need to stress-test.
We already tried to run our classifier on the CPU, but classification time was an order of magnitude slower then on the 960M, so that doesn’t seem a good option, especially considering the price of a GTX960 card.
We’ll do a few more tests at some point. If we find out anything interesting, I’ll post back here…
F
Hi Tim,
Thank for your responses. I read your posts and I remembered an image of a software in Ubuntu to visualize states of GPU. Something that is similar to Task Manager for CPUs. If you have information, please let me know.
Hi Tim,
I have a minor question related to 6-pin and 8-pin power connector. It is related to your sentence “One important part to be aware of is if the PCIe connectors of your PSU are able to support a 8pin+6pin connector with one cable”.
My workstation has one 8-pin cable to TWO 6-pin cable connectors. Is it possible that we plug into these two 6-pin connectors to power up Titan X which requires 6-pin and 8-pin power connectors? I think I will try it, because I want to plug 2 GPUs Titan X and only this way my workstation can support up two GPUs.
Thank you so much.
@An
I think this will depend somewhat on how the PSU is designed, but I think you should be able to power two GTX Titan X with one double 6-pin cable, because the design makes it seem that it was intended for just that. Why would they put two 6-pin connectors on a cable if you cannot use them? I think you can find better information if you look up your PSU and see if there is a documentation, specification or something like that.
Hi Tim,
Firstly, thanks for this article; it’s extremely informative (in fact your entire blog makes fascinating reading for me, since I’m very new to neural networks in general).
I want to get a more powerful GPU to replace my old Gtx 560 Ti (a great little card, but 1gb of memory is really limiting and I presume it’s pretty slow these days too). Sadly I cannot really afford the GTX Titan X (as much as I’d like to, 1300 CAD is too damn high). The 980 Ti is also a bit on the high end, so I’m looking at the 980, since it’s about 200 CAD cheaper. My question is; how much performance am I gaining from my old 560 Ti to a 980/980 Ti/Titan X? Is the difference in gained speed even that large? If it’s worth saving for the bigger card then I’ll just have to be patient.
I’m currently running torch7 and a LSTM-RNN with batches of text, not images, but if I want to do image learning I assume I’d want as much RAM as possible?
Cheers 🙂
The speedup should be about 4x when you go from a GTX 560 Ti to a GTX 980. The 4GB ram on the GTX 980 might be a bit restrictive for convolutional networks on large image datanets like ImageNet. A GTX Titan X or GTX 980 Ti will only be 50% faster than a GTX 980. If you wait about 14-18 months you can get a new Pascal card which should be at least 12x faster than your GTX 560 Ti. I personally would value getting additional experience now as more important than getting less experience now and faster training in the future — or in other words, I would go for the GTX 980.
How exactly would I be restricted by the 4GB of ram? Would I simply not be able to create a network with as many parameters, or would there be other negative effects (compared to the 6GB of the 980 Ti)?
You’ve mentioned in the past that bandwidth is the most important aspect of the cards, and the 980 Ti has 50% higher bandwidth than the regular 980; would that mean it’s 50% faster too, or are there other factors involved?
Yes, thats correct, if your convolutional network has too many parameters it will not fit into your RAM. Other factors besides memory bandwidth only play a minor role, so indeed, it should be about 50% better performance (not the 33% I quoted earlier (I edited this for correctness just now)).
Thank you so much for such informative article!
How would GTX Titan Z compare to GTX Titan X for the purpose of training a large CNN? Do you think it’s worth the money to buy a GTX Titan Z or is a GTX Titan X good enough? Thanks!
A GTX Titan X will be much better in most cases. If you want more details have a look at my answer about this equation on quora.
I have been looking for an affordable CPU with 40 lanes without luck. Could you give me with a link?
I am also curious about the actual performance benefit of 16x vs 8x. If the bottleneck are the DMA writes will the performance reduce by halve?
Hey Tim,
Thank you so much for this great writeup, it’s been pivotal in helping me and my co-founder understand the hardware. We’re a duo from MIT currently working on a venture backed startup bringing deep learning to education, hoping to help at least improve, if not fix, the US education system.
Our first build is aiming to be cheap where it can (since both of us are beginners and we need to be frugal with our funding) but future proof enough for us to do harder things.
My current build consists of these parts:
Mobo: Asus X99-E WS SSI CEB LGA2011-3 Motherboard
CPU: Intel Core i7-5820K 3.3GHz 6-Core Processor
Vide Card: EVGA GeForce GTX 960 4GB SuperSC ACX 2.0+ Video Card
PSU: EVGA 850W 80+ Gold Certified Fully-Modular ATX Power Supply
RAM: Corsair Vengeance LPX 16GB (2 x 8GB) DDR4-3000
Storage: Sandisk SSD PLUS 240GB 2.5″ Solid State Drive
Case: Corsair Air 540 ATX Mid Tower Case
Could you look over these and offer any critique? My logic was to have a Mobo and CPU that could handle upgrading to better hardware later, things like the PSU, Ram, and the 960 I’m willing to replace later on.
Thank you in advance! Also is there a way we could exchange emails and chat more?
Would love any advice we can get from you while we build out our product.
Best,
Mike Xia
Looks good. The build is a bit more expensive due to the X99 board, but as you said, that way it will be upgradeable in the future which will be useful to ensure good speed of preprocessing the ever-growing datasets. You are welcome to send me an email. My email is firstname.lastname@gmail.com
What are your opinions on RAID setups in a deep learning rig? Software-based RAID is pretty crappy in my experience and can cause a lot more problems than it solves. However, RAID controllers take a PCI-E slot which will, fortunately/unfortunately, all be taken by 4 x Gigabyte GTX 980 TI cards. Is it worth running RAID with the software controller? Or is it better just to do full copy clone backups?
I do not think it is worth it. Usually, a common SATA SSD will be fast enough for most kinds of data; in come cases there will be a decrease in performance because the data takes too long to load, but compared to the effort and money spend on a RAID system (hardware) it is just not worth it.
Hi,
thanks a lot for all this information. After stumbling across a paper from Andrew Ng et al (“Deep learning with COTS HPC systems”) my original plan was to also build a cluster (to learn how it is done). I wanted to go for two machines with a bunch of GTX Titans but after reading your blog I settled with only one pc with two GTX 980s for the time being. My first thought after reading your blog was to actually settle for two 960s but then I thought about the energy consumption you mentioned. Looking at the specifications of the nvidia cards I figured the 980 were the most efficient choice currently (at least as long as you have to pay German energy prices).
As I am still relatively fresh to machine learning I guess this setup will keep me busy enough for the next couple of months, probably until the pascal architecture you mentioned is available (I read somewhere 2nd half of 2016). If not then I guess I will buy another PC and move one of the 980s into it so that I can learn how to setup a cluster (my current goal is learning as much as fast as possible).
The configuration I went for is as follows:
CPU: Intel i7-5930k (I chose this one instead of the much cheaper 5920 as it has the 40 PCI lanes you mentioned, which gives the additional flexibility of handling 4 graphics cards)
Mainboard: ASRock Fatal1ty X99 Professional (supports up to 4 graphics cards and has a M.2 slot)
RAM: 4×8 GB DDR4-3000
Graphics Card: 2x Zotac GTX 980 AMP! Edition
Hard Disk: Samsung SSD SM951 with 256 GB (thanks to M.2 it offers 2 GB/s of sequential read performance)
Power Supply: be quiet! BN205 with 1200 Watts
I hope that installing Linux on the ssd works as I read that the previous version of this ssd mad some problems.
Thanks again
Sascha
Hi Sascha! Your reasoning is solid and it seems to got a good plan for the future. Your build is good, but as you say, the PCIe SSD could be a bit problematic to set up. Another fact to be aware of is that your GPUs will have a slower connections with that SSD, because the SSD takes away bandwidth from your GPUs (your GPUs will run at 16x/8x instead of 16x/16x). Overall the PCIe SSD would be much faster for common applications, but slower when you use parallelism on two GPUs, it might be better to go for a SATA SSD (if you do not use parallelism that much a PCIe SSD is a solid choice). A SATA SSD will be slower than the PCIe one, but it should be still be faster enough for any deep learning task. However, preprocessing will be slower on this SSD, and this is probably the main advantage of the PCIe SSD.
That is an interesting point you make regarding the M.2. I did not realise that this is how the board will distribute the lanes. I figured that as the M.2 only uses 4 lanes the two cards could each run with 16 and if I actually decided to scale up to a quad setup each card eventually would only get 8 lanes.
My first idea after reading the comment was to just try the ssd in the additional M.2 PCI 2.0 slot, which is basically a SATA 6 connection but that will not work, as it will not fit because one has the Key B and the other the Key M layout.
Do you have an idea about what this actually means for real life performance in deep learning tasks (like x% slower)?
Greetings
Sascha
When I think about it again, I might be wrong about what I just said. How two GPUs and the PCIe SSD will work depends highly on your motherboard and how the PCIe slots are wired and how the PCIe-switches are distributed. I think with a 40 PCIe lane CPU and a mainboard that supports 16x/16x/8x layout, it should be possible to configure that to use 16 lanes for your GPUs and 8 lanes for your SSD; to use that setup you only need to make sure to plug everything into the right slot (your mainboard manual should state how to do this). I have not looked at your hardware in detail, but I think your hardware supports this.
If your motherboard does not support 16x/16x/8x, then your GPU parallelism will suffer from that. Convolutional nets will have a penalty of 5-15% depending on the architecture, recurrent networks may have a low or no penalty (LSTMs) or a high penalty (20-50%) for recurrent nets with many parameters like vanilla RNNs.
Does anyone know what would be the requirements for prediction clusters? Most articles focus on training aspects but inference/prediction is also important and compute demand for these are little discussed. Can anyone comment on compute demands for prediction? Also, what do you recommend, CPU only, CPU+GPU, or CPU+FPGA, etc for such tasks?
Thanks,
Vinay
It depends on many factors which is a suitable solution. If you build a web application, how long do you want your user to wait for a prediction (response time)? How many predictions are requested per second in total (throughput)?
Prediction is much faster than training, but still a forward pass of about 100 large images (or similar large input data) takes about 100 milliseconds on a GPU. A CPU could do that in a second or two.
If you predict one data point at a time a CPU will probably be faster than a GPU (convolution implementations relying on matrix multiplication are slow if the batch sizes are too small), so GPU processing is good if you need high throughput in busy environments, and a CPU for single predictions (1 image should take only 100 milliseconds for a good CPU implementations). Multiple CPU servers might also be an option, and usually they are easier to maintain and cheaper (AWS spot instances for example, also useful for GPU work). Keep in mind that all these these numbers are reasonable estimates only and will differ from the real results; results from a testing environment that simulates the real environment will make it clear if CPU servers or GPU servers will be optimal.
I do not recommend FPGA for such tasks since over time, interfaces to FPGA are not easy to maintain and cloud solutions do not exist (as far as I know).
I just want to thank you again Tim for the wonderful guide. I do have a couple of hardware utilization questions though. I am trying to figure out how to properly partition my space in ubuntu to handle my requirements. I dual boot Windows 10 (for work/school) and Ubuntu 14.04.3 (deep learning) with each having their own SSD boot drive and HDD storage drive. For starters here’s my setup:
– ASRock X99 WS-E
– 1x Gigabyte G1 980 ti
– 16GB Corsair Vengeance RAM 2133
– i7-5930k
– 2x Samsung 850 Pro 256GB SSDs (boot drives)
– 2x Seagate Barracuda 3TB HDDs (storage drives)
My windows install is fine, but I want to be able to store currently unused data in the HDD, stage batches in the SSD then send the batches from SSD to RAM to fully leverage the IOPS gain in a SSD.
I currently have Ubuntu partitioned this way, however I’m not entirely sure this will fit my needs. I’m thinking I might want to allocate /home on the HDD due to how ubuntu handles the /home directory in the UI, but I’m unsure if that will be a problem with deep learning:
SSD (boot):
– swap area – 16GB
– / – 20GB
– /home – 20GB
– /var – 10GB
– /boot – 512MB
– /tmp – 10GB
– /var/log – 10GB
HDD
– /store 1TB
Hello Tim,
Thank you for your article. The deep learning devbox (NVIDIA) has been touted as cutting edge for researchers in this area. Given your dual experience in both the hardware and algorithm sides, I would be grateful to hear your general thoughts on the devbox. I know it came out a few months after you wrote your article.
Thank you!
Tim, thanks again for such a great article.
One concern that I have is that I also use triple monitors for my work setup. However, doesn’t the fact that you’re using triple monitors effect performance of your GPU? Do you recommend buying a cheap $50 gpu for your triple monitor setup and then dedicating your titan x or your more expensive primarily to deep learning? I run Recurrent Neural Nets,
Thanks!
Three monitors will use up some additional memory (300-600MB) but should not affect your performance greatly (< 2% performance loss). I recommend getting a cheap GPU for your monitors only if you are short on memory.
Thanks — that makes alot of sense. I just thought it would affect your bandwidth (as that is usually the bottleneck). I’m currently running the 980 TI — I know it has 336Gb/s. Good to know that it uses some memory though. Appreciate it.
Hello Tim, what about external graphic cards connected through Thunderbolt? Have you looked at those? Could that be a cheap solution without having to build/buy a new system?
I looked at some performance reviews and they state about 70-90% performance for gaming. For deep learning the only performance bottleneck will be transfers from host to GPU and from what I read the bandwidth is good (20GB/s) but there is a latency problem. However, that latency problem should not be too significant for deep learning (unless it’s a HUGE increase in latency, which is unlikely). So if I put these pieces of information together it looks as if an external graphics card via Thunderbolt should be a good option if you have an apple computer and have the money to spare for the suitable external adapter.
Hi Tim,
First thanks a lot for these interesting and useful topics. I am a PhD student i work on Evolutionary ANNs.
I want to start using GPUs, my budget can reach 150$ Max.
I found in my town a new GTX 750 and GTX 650 Ti. Which one is better and are they supported by cuDNN.
Thank
A GTX 750 should be better, and both support cuDNN. However, I would also suggest that you have a look at AWS GPU instances. The instance will be a bit faster and may suit your budget well.
Hi Tim..
Recently I have had a ton of trouble working with Ubuntu 14.04 …installing Cuda, caffe etc. Ubuntu has password locked me out of my system twice and getting all dependencies installed to make caffe to install has been a real problem. It works sometimes …other times it doesn’t work. Ubuntu 14.04 is clearly an unstable OS.
I would like your opinion TIm on moving from Linux to Windows for deep learning? What are your thoughts?
Thanks in advance…
-Greg
I can feel your pain — I have been there too! Ubuntu 14.04 is certainly not intuitive when you are switching from Windows and a simple unseemingly command can ruin your installation. However, I found once you understand how everything is connected in Linux things get easier, make sense, and you no longer will run into errors which break your installations or even our OS. After this point, programming in Linux will be much more comfortable than in Windows due to the ease to compile and install any library. So it may be painful but it is well worth it. You will gain a lot of if you go through the pain-mile — keep it up!
After Ubuntu 14.04 locking me out 3 times via a booting up and false logon screen… I thought I’d try Ubuntu 15.04. I think the Cuda driver slammed Unity resetting the root password to something other than the password I gave it. I search the web and this is a common problem and there seems to be no fix.
I’m running x99 MB, I7 5930, 64 GB ram, and one Titan x. I’ll get a second Titan x when I’m ready for it. I want to create my own NN and nodes but for now I have a ton of learning to do and I need to follow what’s been done so far.
Do you use standard libraries and algorithms like Caffe, Torch 7 and Theano via Python? I feel I need to wade through everything to see how it works before using it. Nvidia Digits looks pretty simple working from the GUI but it also looks, from my limited experience, like it’s pretty limited.
Is this because of your x99 board? I never had any problems like that. As for the software, Torch7 and Theano (Keras and derivatives) works just fine for me. I have tried Caffe once and it worked, but I also heard some nightmare stories about installing Caffe correctly. NVIDIA Digits will be just as you described: Simple and fast, but if you want to do something more complex it will just be an expensive fast PC with 4 GTX Titan X.
Just to tag onto this, I have an X-99 E board, and had some problems on the initial install when trying to boot into ubuntu’s live installer, nothing with the password though. After installing everything worked fine at the OS level. In case this is relevant, reflashing to the latest BIOS helped a lot, but probably won’t help your password problem.
Cheers and best of luck!
Mike
Yes, I did the BIOS flash in the beginning.
Lastly, I kept testing and found the culprit….when installing Cuda I can’t install the 502 driver that it comes with or the Ubuntu system locks with an unknown password…no matter trying a ton if different ways to install the Cuda driver. I scoured the internet for a solution and there wasn’t one and it looks like no one has put 2 n 2 together about the Cuda driver. It could be a combo of things both hardware and software but it definitely involves this driver the x99 mb, a titian x and Ubuntu 14.04 and 15.04.
Thanks.
Hi Tim, The company that I buy my servers from (Thinkmate) recently sent me an e-mail advertising that they’ve been working with Supermicro to sell servers with support for Titan X. What do you think about this solution? I’ve had a lot of luck with Supermicro servers, and they offer 3 year warranty on the Titans and will match the price if found cheaper elsewhere. Here’s the link: http://www.thinkmate.com/systems/servers/gpx/gtx-titan-x
Hi Brent, I think in terms of the price, you could definitely do better on the 1U model with 4 GTX Titan X. A normal board with 1 CPU will not have any disadvantage compared to the 1U model for deep learning.
However, the 4U model is different because it can use 8 GTX Titan X with a fast CPU-to-CPU switch which makes parallelization of 8 GPUs easy and fast. There are only few solutions available that are build like this and come with 8 GTX Titan X — so while the price is high, this will be a rather unique and good solution.
Hi Tim,
Thank you very much for all the writting. I am an objective C developer but a brand newbie to the deep learning thing and so interested in this area right now.
I got a Mac 3.1 and I would like to upgrade the graphic card for having CUDA to run torch7, lua and nn as to learn about this programming. Don’t bother if this should be a Mac card or Windows card.
Which one should you recommend? GTX 780Ti?GTX 960 2GB? GTX 980? Tesla M2090(second hand)?
Look forward to your advice.
From the cards you listed the GTX 980 will be the best by far. Please also have a look at my GPU guide for more info how to choose your GPU.
Thank you very much. I got a generous sponsor to build up a new ubuntu machine with 2 GTX 780 Ti. Should I use the GTX 980 in the new machine to yield better performance than a SLI GTX 780 Ti or let it stay in my Mac?
If you already have the two GTX 780 Ti I would stick with that and only change/add the GPU if you experience RAM shortage for one of your models.
Thank you very much Tim. I am looking forward to your further writing.
By the way, do you have time to look at the neuro-synaptic chip from IBM yet? Really interested in your “deep analysis” on this as well.
Hey Tim…
Do you have any suggestions for a tutorial for DL using Torch7 and Theano and/or Keras?
Thanks
Greg
Hi Tim,
Great post; In general all of the content on your blog has been fantastic.
I’m a little curious about your thoughts on other types of hardware for use in deep learning. I’ve heard a number of people suggest FPGAs to be potentially useful for deep learning(and parallel processing in general) due to their memory efficiency vs. GPUs. This is often mentioned in the context of Xeon Phi….what are your thoughts on this? If true, where does the usefulness lie, in the ‘tracking’ or ‘scoring’ part of deep learning(my perhaps incorrect understanding was GPUs advantage was their use for training as opposed to scoring)?
My apologies for what I’m certain are sophomoric questions; I’m trying to wrap my head around these matters as someone new to the subject!
Regards,
BK
Nonsense, these are great questions! Keep them coming!
FPGAs could be quite useful for embedded devices, but I do not believe they will replace GPUs. This is because (1) their individual performance is still worse than an individual GPU and (2) combining them into sets of multiple FPGAs yields poor performance while GPUs provide very efficient interfaces (especially with NVLink which will be available at the end of 2016). GPUs will make a very big jump in 2016 (3D memory) and I do not think FPGAs will ever catch up from there.
Xeon Phi is potentially more powerful than GPUs, because it is easier to optimize them at the low level. However, they lack the software for efficient deep learning (just like AMD cards) and as such it is unlikely that one will see Xeon Phis be used for deep learning in the future (unless Intel creates a huge deep learning initiative that rivals the initiative of NVIDIA).
Thanks for the response! That’s very interesting.
I wanted to follow up a little bit regarding software development for NVIDIA vs. Intel or AMD. I know how much more developed CUDA libraries are when it comes to Deep learning than OpenCL. What frameworks can I actually run with an intel or AMD architecture? Do torch/caffe/Theano only work on NVIDIA hardware? Once again, my apologies if I’m fundamentally misunderstanding something.
One last question, beyond the world of deep learning, what is the perception of xeon phi? It seems hard to find people who are talking with certainty as to what its strengths/applications will be. Is there any consensu on this? what do you think makes most sense for xeon phi as an application?
Many thanks!
-BK
Tim,
Thank you for the many detailed posts. I am going with a one GPU Titan X water cooled solution based on information here. Does it still hold true that adding a second GPU will allow me to run a second algorithm but that it will not increase performance if only one algorithm is running? Best Regards – Eric
There are now many good libraries which provide good speedups for multiple GPUs. Torch7 is probably the best of them. Look for the Torch7 Facebook extensions and you should be set.
Hello! First off, I just want to say this website is a great initiative!
I’m going to use Kaldi for speech recognition the next spring in my master thesis. Not knowing exactly what type of DNNs I’ll be implementing, I’m planning for an allround solid, budget GPU. Is the GTX 950 with 2 GB suitable (I haven’t seen this mentioned here)? It only requires a 350 W PSU, which is why I’m considering it. Also I have a Q6600 CPU and a motherboard that has 4 GB RAM as a max, so this is a bit constraining on the overall performance of this setup. And apologies if this is too general a question. I’m just now getting into the field 🙂
The GTX 950 2GB variant might be a bit short on RAM for speech recognition if you use more powerful models like LSTMs. The cheapest solution might be to prototype on your CPU and use AWS GPU instances to run the model if everything looks good. This way you need no new computer/PSU and will be able to run large LSTMs and other models. If this does not suit you, a GTX 950 with 4GB of memory might be a good choice.
Just a quick note to say thank you and congrats for this great article.
Very nice of you to share your experience on the matter.
Regards.
Alex
Thank you! I am happy that you found the article helpful!
Hey Tim,
Thanks for the great article; I have a more specific question though – I’m building an entry-level Kaggle-worthy system using an i7-5820K processor. Since I want to keep my GTX 960’s 4GB memory solely for deep learning, would you recommend I buy an additional (cheaper) graphic card for display or not? I’m considering the GT 610 for this purpose since it’s cheap enough. Also, if I were to do this, where would I specify such a setting (e.g. use GT 610 for display)?
Thanks again!
Rohit
For most datasets on Kaggle your GPU memory should be okay and using another small GPU for you monitors will not do much. However, if you are doing one of the deep learning competitions and you find yourself short on memory and you think you could improve your score by using a model that is a bit larger then this might be worth it. So I would only consider this option if you really encounter problems where you are short on memory.
Also remember that the memory requirements of convolutional nets increases most quickly with the batch-size, so going from a batch-size of 128 to 96 or something similar might also solve memory problems (although this might also decrease your accuracy a bit, its all quite dependent on the data set and problem). Another option would be to use the Nervana system deep learning libraries which can run models in 16-bit, thus halving the memory footprint.
Tim,
First of all, thank you for writing this! This post has been extremely helpful to me.
I’m thinking about getting a gtx 970 now and upgrading to pascal when it comes out. So, if I never use more than 3.5gb vram at a time, then I won’t see performance hits, correct? I’m building my rig for deep reinforcement learning (mostly atari right now), so my minibatches are small (<2MB), and so are my convnets (<2mill weights). Should I be fine until pascal?
I'm trying to decide between these two budget builds: [Intel Xeon e5](http://pcpartpicker.com/p/dXbXjX) and [Intel i5](http://pcpartpicker.com/p/ktnHdC). I'm thinking about going with the Xeon, since it has all 40 pcie lanes if I wanted to do more than two gpus in the future, and it's a beefier processor. However, I start grad school in the fall, so I'd have university hardware then, and think I'd be more than fine with two gpus for personal experiments in the future. (Or could 4 lanes be enough bandwidth for a gpu?) If I get the i5 I could upgrade the processor without having to upgrade the motherboard if I wanted. The processor just needs to be good enough to run (atari) emulations and preprocess images right now. I can't really imagine anything but the GPU being the bottleneck, right?
Thank you for the help. I'm trying to figure out something that will last me awhile, and I'm not very familiar with hardware yet.
Thanks again,
– JB
Hi JB,
the GTX 970 will perform normally if you stay below 3.5GB of memory. Since your mini-batches are small and you seems to have rather few weights this should fit quite well into that memory. So in your case the GTX 970 should give you optimal cost/performance.
Hi Tim:
Thanks so much for sharing your knowledge!
I’ve seen you mentioned that Ubuntu is a good OS..
what is the best OS for deep learning?
What is a good alternative to Ubuntu?
I’d really appreciate your thoughts on this…
Linux based system are currently best for deep learning since all major deep learning software frameworks support linux. Another advantage is, that you will be able to compile almost anything without any problems while on other systems (Mac OS, Windows) there will always be some problems or it may be nearly impossible to configure a system well.
Ubuntu is good, because it is widely used, easy to install and configure, and it has some support for their LTS versions which makes it attractive for software developers which target linux systems. If you do not like Ubuntu you can use Kubuntu, or other X-buntu variants; if you like a clean slate and to configure everything they way you like I recommend Arch Linux, but be beware that it will take a while until you configured everything the way it is suitable for you.
Hi Tim,
Great website ! I am building a Devbox, https://developer.nvidia.com/devbox.
My machine has 4 Titan X cards, Kingston Digital HyperX Predator 480 GB PCIe Gen2 x 4 , Intel Core i7-5930K Haswell-E, and G.SKILL 64GB. I am using ASUS RAMPAGE V extreme motherboard. When I place the last Titan X card on the last slot, my SSD gets disapered from bios. I am not sure I have a PCIe conflict ? Does M.2 can interfere with PCIE_X8_4. What should I do to fix this issue ? Should I change the motherboard, any advice ?
Your motherboard only supports 40 PCIe lanes, which is standard, because CPUs only support a maximum of 40 PCIe lanes. Your 4 Titan X will run in 16x/8x/8x/8x lane mode. You might be able to switch the first GPU to 8x manually, but even then CPUs and motherboards usually do not support a 8x/8x/8x/8x/8x mode (usually two PCIe switches are supported for a single GPU, and a single PCIe switch supports two devices, so you can only run 4 PCIe devices in total). This means that there is probably no possibility to get your PCIe SSD working with 4 GPUs. I might be wrong. To check this it is best to contact your ASUS tech support and ask them if the configuration is possible or not.
Hi Tim,
Thank you for the wonderful guide.
As Lawrence, I’m also building a GPU workstation using https://developer.nvidia.com/devbox as the guide. It mentions that “512GB PCI-E M.2 SSD cache for RAID”. I wonder how to setup this SSD as the cache for RAID, since RAID 5 does not support this as I know. Have you done anything similar? Thank you very much.
Hi Bobby,
I have no experience with RAID 5, since usual datasets will not benefit from increased read speeds as long as you have a SSD. I think you will need to change some things in BIOS and then setup a few things for your operating system with a raid manager. I think you will be able to find a tutorial for your OS online so you can get it running.
Hi Tim,
It seems it’s not related to the RAID. I wonder how to setup an SSD as the cache for a normal HDD. Setting it as the cache for RAID should be similar. With this, I may not need to manually copy my dataset for HDD to SSD before experiment. Thank you.
Hey Tim,
first of all thank you very much for your great article. It helped me alot to gain some inside in the hardware requirements needed for any DL machine. Over the past several years i only worked with laptops (in freetime) as i had some good machines at work. Now i am planning to set up some system at home to start experimenting on some stuff in my free time. After i read your post and many of the comments i started to create a build (http://de.pcpartpicker.com/p/gdNRQ7) and as you looked over so many systems and gave advices i hoped that you can maybe do it once again 😉
I choosed the 970 as a starter, and then wait for the pascal cards comming out later this year. I am also not planning to work with more than 2 gpus in the future at home. And for the monitor. i already have one 24″ at home, so this will just be the 2nd.
I dunno, maybe you can look over it and give me some advices or your opionon.
Looks like a solid build for a GTX 970 and also after an upgrade to one or two Pascals this is looking very good.
Thanks for the time you are spending, giving so many people advices. It is/was quite hard for me after so many years of laptop use to dive back into hardware specifics. You made it a lot easier with your post. Big thanks again!
Nice article!
What do you think about HBM? Apart from the size of ram, do you think that fury x has any advantage comparing to 980Ti?
The Fury X definitely has the edge over the GTX 980 Ti in terms of hardware, though in terms of software the AMD still lags behind. This will change quite dramatically once NVIDIA Pascal hits the marked in a few months. HBM is definitely the way to go to get better performance. However, the HBM of NVIDIA offers double the memory bandwidth from the Furx X and Pascal will also allows for 16-bit computations which effectively doubles the performance further. So I would not recommend getting a Fury X, but instead to wait for Pascal.
How soon do you think will flagship of Pascal, like Titan X, be on the market? I am not sure if I should wait. Thank you.
Hi Tim — Thanks for this article, I’ve found it extremely useful (as have others, clearly).
You’re probably aware of this, but the new Titan X Pascal cards have very weak FP16 performance.
Yes the FP16 performance is disappointing. I was hoping for more, but I guess we have to wait until Volta is released next year.
Thank u. But consider the size of the memory and the brand, I am afraid the price of pascal would far beyond my budget?
Hi Tim
Thanks a lot for your article. It answered some of my questions. I am actually new to deep learning and know almost nothing of GPUs. But I have realized that I need one. Can you comment on the expected speedup if I use ConvNets on a Titan X than a n intel corei7 4770-3.4 Ghz?
Even a vague figure would do the job.
Best Regards
Wajahat
It depends highly on the kind of convnet you are want to train, but a speedup of 5-15x is reasonable. However, if you can wait a bit I recommend you to wait for Pascal cards which should hit the market in two months or so.
Hi Tim,
Thanks for this excellent primer. I am trying to get a part set and have this so far (http://pcpartpicker.com/p/JnC8WZ) but it has some 2 incompatibility issues. Basically, I want to be working through this 2nd Data Science Bowl (https://www.kaggle.com/c/second-annual-data-science-bowl) as an exercise. I will likely work with a lot of medical image data. Also, I will use this system as an all-purpose computer too (for medical writing), so I’m wondering if I also need to add the USB, HDMI, and DVI connects (I currently also use an Eizo ColorEdge CG222W monitor). Also, I like the idea of 2 hard drives, one for Windows and one for Linux/Ubuntu (or I could partition?) Finally, I use a wireless connect, hence that choice. I would be most grateful if you could help with the 2 incompatibilities, any omissions, and seeing if this system would generally be ok. Thank you in advance for your time.
You can resolve the compatibility issue by choosing a larger mainboard. A larger mainboard should give you better RAM voltage and also fixes the PCIe issue. Although the GTX 680 might be a bit limiting for training state of the art models, it is still a good choice to learn on the Data Science Bowl dataset. Once Pascal hits the market you can easily upgrade and will be able to train all state-of-the-art networks easily and quickly.
Thank you for this response. I had the GTX 980 selected (in the pcpartpicker permalink), but I may well just wait for the Pascal that you suggested. I read this article (http://techfrag.com/2016/03/18/nvidia-pascal-geforce-x80-x80ti-gp104-gpu-supports-only-gddr5-memory/), however, and suppose I must admit I’m quite confused with the names, the relationship of “Pascal” to GeForce X80, X80Ti & Titan Specs, and also the concern with respect to GDDR5 vs. GDDR5X memory. Is it worth it to wait for one of the GeForce (which I assume is the same as Pascal?) rather than just moving forward with the GTX 980? Will one save money by way of sacrificing something with respect to memory? Please forgive my neophyte nature with respect to systems.
Pascal will be the new chip from NVIDIA which will be released in a few months. It should be designated as GTX 10xx. The xx80 refers to the most powerful GPU consumer model of a given series, e.g. the GTX 980 is the most powerful the 900s series. The GTX Titan is usually the model for professionals (deep learning, computer graphics for industry and so forth).
And yes I would wait for Pascal rather than buy a GTX 980. You could buy a cheap small card and sell it once Pascal hits the market.
You say GTX 680 is appropriate for convnets, however I see GTX 680 just has 2GB RAM which is inadequate for most convnets such as AlexNet and of course VGG variants.
There is also a 4GB GTX 680 variant which is quite okay. Of course a GTX 980 with 6GB would be better, but it is also way more expensive. However, I would recommend one GTX 980 over multiple GTX 680. It is just not worth the trouble to parallelize on these rather slow cards.
“CPU and PCI-Express. It’s a trap!”
I have no idea what that is supposed to mean. Does that mean I avoid PCI express? Or just certain Haswells? What is the point here?
Certain Haswells do not support the full 40 PCIe lanes. So if you buy a Haswell make sure it support it if you want to run with multiple GPUs.
Why is the aws g2.8x not enough?
It says 60gibs(approx 64gbs) of gpu memory
Thanks
The 60GB refers to the CPU memory that the AWS g2.8x has. The GPU memory is 4GB per card.
Hi Tim,
Thanks for the great post. I am a graduate student, and would like to put together a machine recently. But if I put up a system with i7-5930K CPU, Asus X-99 deluxe MOBO and two titan x GPUs for now, will the pascal GPUs compatible with this configuration? Can I just simply plug in a Pascal GPU when it is released? Thanks a lot.
As far as I understand there will be two different versions of the NVLink interface, one for regular cards and one for workstations. I think you should be alright with your hardware, although you might want to wait for a bit since Pascal will be announced soon and probably ship in May/June.
Thanks for the great answers. Do you think that one Titan 12 GM of memory is better than, say, two GTX 980s, or two of the upcoming Pascals (xx80s)? I currently have a system designed that has a motherboard such that has the additional PCIe lanes but that (as I’ve been told by the Puget Systems people) adding a second GPU would slow down things by 2x. So I thought “just get the Titan w/ 12 GB of memory and be done with it.” Do you think that sounds ok? Or do I upgrade the motherboard? I’m thinking that the Titan may be more than I ever need, but unfortunately I do not know. Thank you for your great help and thorough work.
Hey Tim,
Awesome article. Was curious whether you have an opinion on the Tesla M40 as well.
Looks suspiciously similar to the Titan X.
Think the “best DL acceleration” claim might be a bit of a marketing gamble?
Cheers,
–Razvan
This post is getting slowly outdated and I did not review the M40 yet — I will update this post next week when Pascal is released.
To answer your question, the Titan X is still a bit faster with 336 GB/s while the M40 sports 288 GB/s. But the M40 has much more memory which is nice. But both cards will be quite slow compared to the upcoming Pascal.
Wow, I am super glad I read this response. Based on your comment about the Pascal vs. the Titan X, I was able to place the development of my system on hold, just in time! I was going to get a Titan X. But now I will want to know if it will be much better to get the Pascal with 32 GB of dedicated RAM (VRAM?) vs. the 12 GB of the Titan X. http://www.pcworld.com/article/2898175/nvidias-next-gen-pascal-gpu-will-offer-10x-the-performance-of-titan-x-8-way-sli.html
Do you have specific information that suggests it will be one week yet before the Pascals will be available? How much do you thing the 1080 will be (in USD, Euros, etc.)?
The Pascal P100 won’t even be available to most of us until later this year at the soonest (http://wccftech.com/nvidia-pascal-gpu-gtc-2016/) and it isn’t even in the same league as the Titan X. They haven’t said anything about the 10xx’s, so I’m assuming they will be quite a while yet also?
Hi Tim,
Thanks for the post! Very helpful. Was just wondering what editor (monitor in the center) did you use in the picture showing the three monitors?
That is an AOC E2795VH. Unfortunately they are not sold anymore. But I think any monitor with a good rating will do.
Hi,
Thanks for this post. Are there any Cloud solutions yet?
I used Amazon g2.2xlarge as well as g2.8xlarge as Spot Instances,
however, the GPUs are old, don’t support the latest CUDA features and spot prices
have increased.
There are also some smaller providers for GPUs but their prices are usually a bit higher. Newer GPUs will also available via Microsoft Azure N-series sometime soon, and these instances will provide access to high-end GPUs (M60 and K80). I will look into this issue in the next week when I will update my GPU blog post.
Can you recommend a good box which supports:
1. multiple GPUs for deep learning (say the new Nvidia GP100),
2. additional GPU for VR headset,
3. additional GPU for large monitor?
Thanks!
Everyone seems to be using an Intel CPU, but they seem prohibitively expensive if actual clock speed or cache isn’t that important… Would an AMD cpu with 38 lane support work just as well paired with two GPUs?
Also, have you experimented with builds using two different GPUs?
Yes, a AMD CPU should work just as well on 2 GPUs as an intel one. However, using two different GPUs will not work if the have different chipsets (GTX 980 + GTX 970 will not work); what will work if you have different vendors (EVGA GTX 980 + ASUS GTX 980 will work with no problems).
I see – thanks! I’m considering just getting a cheaper gpu to at least get my build started and running and then picking up a Pascal gpu later. My plan was to use the cheaper gpu to drive a few monitors and use the Pascal card for deep learning. That kind of setup should be fine right? In other words, there is only an issue with two different cards if I try to use them both in training, but I’m essentially using just a single gpu for it
Hi Tim,
This post was amazingly useful for me. I’ve never built a machine before and this feels very much like jumping in the deep end. There are two things I’m still wondering about:
1. If I’m using my GPU(s) for deep learning, can I still run my monitor off of them? If not should I get some (relatively) cheap graphics card to run the monitor, or do something else?
2. Do you have any opinion about Intel’s i7-4820K CPU vs. the i7-5820K CPU? There seems to be a speed vs. cache size & cores trade-off here. My impression is that whatever difference there is will be small, but the larger cache size should lead to fewer cache misses, which should be better. Is this accurate?
Thanks
Was just reading through the Q/A’s here and saw your response to Rohit Mundra (2015-12-22) answered my first question.
Sorry for the repeat….
No problem, I am glad you made the effort to find the answer in the comment section. Thanks!
My guess is that (if done right) the monitor functionality gets relegated to the integrated graphics capability of the motherboard. Just don’t try to stream high-res. video while training an algorithm.
Ooops – I should have mentioned that the motherboard I’m using is an ASRock Fatal1ty X99 Professional/3.1 EATX LGA2011-3. It doesn’t have an integrated graphics chip.
Hi Tim, THANKS for such a great post! and all these responses!
I got a question:
What if I buy a TX 1 instead of buying a computer ?
I will do video or CNN images classification sort things.
Cheers,
Dorje
Hi Dorje,
I also thought about buying a TX1 instead of a new laptop, but then I opted against it. The overall performance on the TX1 is great for a small, mobile, embedded device, but not so great compared to desktop GPUs or even laptop GPUs. There might also be issues if you want to install new hardware because it might not be supported by the Ubuntu for Tegra OS. I think in the end the money is better spend to get a small, cheap laptop and buy some credit for GPU instances on AWS. Soon there will also be high performance instances (featuring the new Pascal P100), so this would also be a good choice for the future.
Hi Tim,
Thanks for the great post. Sorry to bother you again. I just want to ask sth about coolbits option of the GPU cards. Right now, I set it to 12 and I can manually control the fan speed. It works nicely. But I won’t check the temperature all the time and change the fan speed accordingly. So during training, how much percentage of fan speed should I use? 50%, or 80% or an aggressive 90% maybe? Thanks a lot.
And if I keep the fan always running at 80% speed, will it reduce the lifecycle of the card? Thanks.
The life expectancy of the card will increase the cooler you keep it. So if you can you can keep the fan at 100% at all times. However, this of course can problems with noise if the machine is nearby you or other people. For my desktop I keep the fan as low as possible to keep the GPU below 80 degrees C and if I leave the room I just set the fan speed to 100%.
Thanks a lot for your reply, it helps a lot.
Keep in mind that running your fans at 100% constantly will wear out the fans much faster – although that is better than a dead GPU chip. It can be difficult to find cheap replacement fans for some GPUs, so you should look for cheap ones on alibaba etc. and have a few spares lying around in advance since shipping from china takes weeks.
Also, when a fan stops running smoothly, you can usually just buy cheap “ball bearing oil” ($4 on ebay or so) and remove the sticker on the front side of the fan. There will be some tiny holes beneath into which you can simply squirt some of the oil and most likely the fan will run as good as new. Worked out for me so far
Thanks for the great blog, i learned a lot.
For me getting a 40 lane or even 28 lane CPU-MB is out of budget. In my country these parts are rare.
I am planning to get a 16 lane CPU. With this i can get MB which has 2xPCIe 3.0 x16. I plan to use single GPU initially. If i want to use 2 GPU’s it has to be x8/x8 configuration. With this configuration is it practical to use 2 GPU’s in the future?
My system will likely have i7 6700, Asus Z170-A and Titan X.
Cheers,
RK
Hi RK,
16 lanes should still work good with 2 GPUs (but make sure the CPU supports x8/x8 lanes — I think every CPU does, but I never used them myself). The transfer to the GPU will be slower, but the computation on the should still be as fast. You probably see a performance drop of 0-5% depending on the data that you have.
Thanks for the fast reply.
You are welcome 🙂
Hi, I am a Brazilian student, so everything is way too expensive for me. I will buy a gtx 960 and start of with a single GPU and expand later on. The problem is that intel CPUs with 30+ lanes are WAY too expensive. So I HAVE to go with AMD, but the motherboards for AMD only have PCIe 2.0.
My question is: can I get a good performance out of 2 x 960 GPUs on a PCIe 2 .0 x16 mobo? By good I mean equal to a single 960 with x16 on a PCIe 3.0, maybe even a single gtx 980.
Hi, both a Intel CPU with 16 lanes or less (as long as your motherboard supports 2 GPUs) as well as AMD with PCIe 2.0 will be fine. You will not see large decreases in performance. It should be about 0-10% depending on task and deep learning software.
If you are short on money it might also be an option to use AWS GPU instances. If you do not train every day this might be cheaper in the end. However, for tinkering around with deep learning a GTX 960 will be a pretty solid option.
Thank you very much, Tim.
I got a Titan X, hahaha~
Cheers,
Dorje
Hi Tim, great post!
Could you talk a bit about having different graphics cards in the same computer? As an extreme example, would having a Titan X, 980 Ti and a 960 be problematic?
Tim,
Any updates to your recommendations based on Skylake processors and specially Quadro GPU’s?
Skylake is not need and Quadro cards are too expensive — so no changes to any of my recommendations.
So reading this post that bandwidth is the key limiter makes me think the gtx 1080 with a bandwidth of 320 will be slightly worse for deep learning than a 980 to. Does that sound right?
You cannot compare the bandwidth of a GTX 980 with the bandwidth of a GTX 1080 because the two cards use different chipsets. The GTX 1080 will definitely be faster.
Hi, does the number of CUDA core matter? GTX 1080 will be released already and it has 2500 CUDA cores whereas a GTX 980 TI has about 2800 CUDA cores. Will this affect the speed of training? Or In general GTX 1080 will be faster with is 8 teraflops of performance?
The number of cores does not matter really. It all depends how these cores are integrated with the GPU. The GTX 1080 will be much faster than the GTX Titan X, but it is hard to say by how much.
So you’d recommend that I invest myself in a GTX 1080 instead? 🙂
Hi Tim. Thanks for an excellent guide! I was wondering what your opinion is on Nvidia’s new graphics card – Nvidia Geforce GTX 1080. The performance is said to beat the Titan X and is proposed to be half the price!
Hi Tim,
I suppose this is echoing Jeremy’s question, but is there any reason to prefer a Titan X to a GTX 1080 or 1070? The only spec where the Titan X still seems to perform better is in memory (12 GB vs. 8 GB).
I got a Titan X on Amazon about 2.5 weeks ago, so have about 10 days to return it for a full refund and try for a GTX 1080 or 1070. Is there any reason not to do this?
No performance data is currently in deep learning is currently available for the GTX 1000s, but it is rather safe to say that these will yield much better performance. If you use 16bit, and probably most libraries will change to that soon, you will see in increase of at least 2 times in performance. I think returning your Titan X is a good idea.
Just wanted to add that Nvidia artificially crippled the 16bit operation on the 1070/1080 GTX to abysmal speeds, so we can only hope they don’t do the same with the Pascal Titan card.
Hello Tim,
Comparing two cards for GPGPU (Deep Learning being an instance of a GPGPU) what is more important: # of cores or memory? For learning purposes and may be some model dev I am considering a low end card (512 cores, 2GB) .. will this seriously cripple me? Other than giving-up performance gains, will it seriously be constraining? I checked research work of folks from 5+ years ago and many in academia used processors with even weaker specs and still got something done. Once I discover that I am doing something real serious I can go to Amazon cloud or get an external GPU (connect via Thunderbolt 3) or build a machine.
Neither cores nor memory is important per se. Cores do not matter really. Bandwidth is important and FLOPS second most important. You need a certain memory to training certain networks. For state of the art models you should have more than 6GB of memory.
Hi Tim, did you connect your 3 monitors to the mainboard/CPU or to your GPU? Does this have an influence on the deep learning computation?
I connected them to two GPUs. It does not really affect performance (maybe 1-3% at most), but it does take up some memory (200-500MB). But overall this effect is neglectable.
Hey Tim…quick question. Do you have any opinion about the new GeForce GTX 1080s for deep learning?
Maybe you already give your opinion but I have missed it.
Thanks,
Greg
Tim,
I’m looking for information on which GPU cards have support for convolutional layers, in particular I was considering a laptop with the GTX 970, but according to your blog above it does not support convolutional nets. Would you ind to explain what does that mean in terms of features and also time performance? Is there a way to know from the spec whether the card is good for conv nets?
thanks in advance
Maybe I have been a bit unclear in my post. The GTX 970 supports convolutional nets just fine, but if you use more then 3.5GB of memory you will be slowed down. If you use 16-bit networks though you can still train relatively well sized networks. So a GTX 970 is okay for most non-research, non-I-want-to-get-into-top5-kaggle use-cases.
Question: For budgetary reasons i’m looking at an AMD cpu / board combination (4 cores) but that combination has no onboard video.
Can the GPU (4GB nvidia 960) which will be used for machine learning also be used at the same time as the videocard (no 3d offcoarse).
Does that work or do i need an extra videocard ? Thanks!
Yes, that will work just fine! This setup would be a great setup to get started with deep learning and get a feel for it.
This is the most informative blog about building a deep learning blog!
Thanks for that.
Now that the Nvidia’s 1080, 1070 are launched, which is a better deal for us?
two 1070s or one 1 080?
Everyone writes in the context of gamers 🙁
I badly need this communities voice here!
I have a laptop with a NVIDIA Quadro M3000M (4.0GB) GDDR5 PCI-Express, I would like to use it for deep learning, I noticed that no-one mentions Quadro cards in the context of deep learning, is there a design reason why these cards are not used in deep learning?
PS: I tried to install ubuntu (all it s versions) and it fails to show the gnome menu, it just shows the background desktop image.
as far as I know, quadro cards are usually optimized for CAD applications, you can use them for deep learning but they will not be as cost efficient as regular geforce cards.
Your problem with Ubuntu not booting is a strange one, does not really look like a graphics driver issue since you get a screen. Before googling for more difficult troubleshooting procedures I would try other Ubuntu 14.04 LTS flavours if I were you, like Xubuntu (windows-like, lightweight), Kubuntu (windows-like, fancy) or even Lubuntu (very lightweight). It may just be some arcane issue with Ubuntu’s Gnome Desktop and your hardware.
thanks so much for your advice! I managed to install Xubuntu 16.04, now the next step is installing CUDA and TensorFlow, I will need all the advice that I can get with that one.
The problem I have with Ubuntu Desktop is known, it looks like they are going to address it in 14.04.1 (sorry for the comment slightly off topic).
http://askubuntu.com/questions/760051/ubuntu-16-04-0-final-unity-desktop-kubuntu-gnome-can-not-boot-from-live-us/760124
You should try to use 14.04, 16.04 still can give you lots of headaches right now.
This is how I do it: http://pastebin.com/E6uFu2Em
This will not work on 16.04 for probably hundreds of reasons
For deep learning on speech recognition, what do you think of the following specs?
It’s going to cost 2928USD. What are your thoughts on this?
– INTEL CORE I7-6800K UNLOCKED FOR OC(28lanes)(6 CORE/ 12 THREADS/3.8GHZ) NEW!
– XSPC RayStorm D5 Photon AX240 (240L)
– ASUS X99-E WS (ATX/4way SLI/8x Sata3/2xGigabit LAN/10xUSB3.0)
– 4 x GSKILL RipjawsV RED 2x8GB DDR4 2400mhz (CL15)
-ZOTAC GTX1080 8GB DDR5X 256BIT Founder’s Edition (1733/10000)-NEW
– SuperFlower Leadex Gold 650W(80+Gold/Full Modular)*5 Years Warranty
– CORSAIR AIR 540 BLACK WINDOW
– INTEL 540s 480GB 2.5″ Sata SSD (560/480)
This is a good build for a general computation machine. A bit expensive for deep learning, as the performance is mostly determined by the GPU. Using more GPUs and cheaper CPU/Motherboard/RAM would be better for deep learning, but I guess you want to use the PC also for something different than deep learning :). This would be a good PC for kaggle competitions. If you plan on running very big models (like doing research) then I would recommend a GTX Titan X for memory reasons.
Thanks for all the info. If I plan to use only one GPU for computation, then would I expect to need two GPUs in my system: one for computation and another for driving a couple of displays? Or can a single GPU be used for both jobs?
A single GPU is fine for both. A monitor will use about 100-300MB of your GPU memory and usually draw an insignificant amount (<2%) of performance. It is also the easier option, so I would just recommend to use a single GPU.
Any comments on MIT’s Eyeriss chip?
http://www.rle.mit.edu/eems/wp-content/uploads/2016/02/eyeriss_isscc_2016_slides.pdf
I haven’t bee able to boot up this MSI laptop with any of the flavors of 14.04 (lubuntu, xubuntu, kubuntu, ubuntu) , could it be the SkyLake processor that it is not compatible with 14.04?
https://bugzilla.kernel.org/show_bug.cgi?id=109081
Looks like I will have to wait until a fix is created for the upstream ubuntu versions or until nvidia updates Cuda to support 16.04. Is there any other thing I can try?
Thanks!
Laptops with a NVIDIA GPU in combination with Linux are always a pain to get running properly as it is often is also very dependent on your other hardware in your laptop. I do not have any experience in this case, but you might be able to install 14.04 and then try to patch the kernel with that you need. Not easy to do though.
Hi Tim Dettmers,
Your blog is awesome. I currently have GeForce GTX 970 on my system , is that sufficient for beginning Convolutional Neural Networks.
A GTX 970 is an excellent option to explore deep learning. You will not be able to train the very largest models, but that is also not something you want to do when you explore. It is mostly learning how to train small networks on common and easy problems, such as AlexNet and similar convolutional nets on MNIST, CIFAR10 and other small data sets, until you get a “feel” for training convolutions nets so that you then can go on with larger models and larger data sets (ResNet on ImageNet for example). So everything is good.
Hello Tim:
Thanks for the great post. I built the following PC base on it.
CPU: i5 6600
Mother board: Z170-p
DDR4: 16g
GPU: nvidia 1080 founder edition
Power: 750W
However, after I install 14.04, I can’t get CUDA8.0 and the new driver install(which claim N1080 user has to renew this driver).
Is the problem occur because of the other components of the PC like mother board?
Thanks!
I have heard that people have problems with Skylake under ubuntu 14.04. But I am not sure if that is really the problem. You can try upgrade to ubuntu 16.04 because the Skylake support is better under that version, but I am not sure if that will help.
Hey,
first of all thanks for the guide, helped me immensely to get some clarity in this puzzle! 🙂
Couple of questions as I’m a bit too impatient to wait for 1080/70 reviews on this topic:
As you stated, bandwidth, memory clock and memory size seem to be one of the most important factors so would it even make sense to put some more money in a solidly overclocked custom GPU? So far I’ll just pick the cheapest solidly cooled one (EVGA ACX 3.0 probably).
Also my initial analysis between 1070GTX vs 1080GTX was heavily in favor for the 1080 GTX based on the benchmarks from http://www.phoronix.com/scan.php?page=article&item=nvidia-gtx-1070&num=4 . Though the theoretical TFLOPS SP MIXBENCH results were closely in favor for the 1070 (76.6 €/TFLOP 1080GTX vs 73.9 €/TFLOP 1070GTX) the SHOC on CUDA results in terms of price efficiency were closely in favor for the 1080GTX but more or less the same . However the GDDRX5 on the 1080 GTX seem to seal the deal I guess for deep learning applications? Also I found the 1080 around 6 Watt/TFlops more cost efficient. Am I on the right track here? Maybe the numbers help some others here searching for opinions on that :).
Anyways after reading through your articles and some others I came up with this build:
http://pcpartpicker.com/list/LxJ6hq . Some comments would be very appreciated 🙂 . I feel like the CPU is a bit overkill but it was the cheapest with DDR4 ram and 40 lanes. Maybe not needed though I’m a bit unsure of that.
Best regards
Wonderful guide ! Thank you !
When I initially left a comment I seem to have clicked on the
-Notify me when new comments are added- checkbox and now every time a comment is
added I get four emails with the exact same comment.
Is there a means you can remove me from that
service? Many thanks!
That sounds awful. I will check what is going wrong there. However, I am unable to remove a single user from the subscription. See if you can unsubscribe yourself. Otherwise please contact the jetpack team. Apparently the data is stored by them and the plugin that I use for this blog access that data as you can read here. I hope that will help you. Thanks for letting me know.
Thank you for an excellent post, I keep coming back here for reference.
With regards to memory types, what role does GDDR5 vs GDDR5X play? Is this an important differentiator between offerings like 1080 and 1070, or is it not relevant for deep learning?
Hi
Asus spec X99-E WS shows that has a PLX chip that provides a additional 48 PCIe lanes. Getting a i7-6850K with a X99-E WS theoretically gives you 88PCIe lanes in total and that is still plenty to run 4 GPUs all at x16.
Is that true for deep learning ?
Thx for reply.
I am not exactly sure how this feature maps to the CPU and to software compatibility. From what I heard so far, you can quite reliably access GPUs from very non-standard hardware setups, but I am not so sure about if the software would support such a feature. If the GPUs are not aware of each other on the CUDA level due to the PLX chip, then this feature will do nothing good for deep learning (it would probably be even slower than a normal board, because probably you would need to go through the CPU to communicate between GPUs).
But the idea of a PLX chip is quite interesting, so if you are able to find out more information about software compatibility, then please leave a comment here — that would not only help you and me, but also all these other people that read this blog post!
In general I seek cheaper idea to assembly set without decreased performance.
Does NVIDIA coolbits take possibility to decrease GPU heats up ?
You wrote about “coolbits” on Ubuntu and problem with headless.
Did you hear about DVI or VGA plug dummy, i.e.
http://www.ebay.com/itm/Headless-server-DVI-D-EDID-1920×1200-Plug-Linux-Windows-emulator-dummy-/201087766664
I think it will be good solution for video card with no monitor attached and no problems with coolbits control.
Tim,
Based on your guide I gather that choosing a less expensive hexa core Xeon cpu with either 28 or 40 lanes will not see a great drop in performance. is that correct? (1-2 GPUs). Can you share your thoughts?
Great guides. very helpful for folks getting into Deep learning and trying to figure out what works best for their budget.
Dante
Yes that is very true. There is basically no advantage from newer CPUs in terms of performance. The only reason really to buy a newer CPU is to have DDR4 support, which comes in handy sometimes for non-deep learning work.
Great article. What would you recommend for a laptop GPU setup rather than a desktop? I see a lot of laptop builds with a 980M or 970M GPU, but is it worth waiting for some variant of the 1080M/1070M/1060M?
A laptop with such a high end graphics card is a huge investment and you will probably use that laptop much longer than people use their desktops (it is much easier to sell your GPU and upgrade for a desktop). I would thus recommend to wait for the 1000M series. It seems it will arrive in some months ahead and the first performance figures show that they are slightly faster than the GTX Titan X — that would be well worth the wait in my opinion!
Hi Tim,
Thanks for the excellent post. The user comments are also pretty informative. Kudos to all.
I recently started shifting my focus from conventional machine learning to Deep Learning. I work in medical imaging domain and my application has a dataset of 50000 color images (5000 per class, 10 classes, size – 512×512). I have a system with Quadro k620 gpu. I want to train state of the art CNN model architectures like Googlenet InceptionV3, VGGnet16, alexnet from scratch. Do the QuadroK620 will be sufficient for training these models. If I have to go for higher end gpu’s, can u please suggest me which card I should go for? (K1080, TitianX, etc). I want to generate the prototypes as fast as possible. Budget is not primary.
A QuadroK620 will not be sufficient for these tasks. Even with very small batch sizes you will hit the limits pretty quickly. I recommend getting a Titan X on eBay. Medical imaging is a field with high resolution images where any additional amount of memory can make a good difference. Your dataset is fairly small though and probably represents a quite difficult task; it might be good to split up the images to get more samples and thus better results (quarter them for example if the label information is still valid for these images) which then in turn would consume more memory. A GTX Titan X should be best for you.
Hey there Tim,
Thanks for all the info!
I was literally pushing send on an email that said “ORDER IT” to my local computer build shop when nVidia announced the new Titan X Pascal.
Do you have any initial thoughts on the new architecture? Especially as it pertains to cooling the NVRAM which usually requires some sort of custom hardware (cooling plate? my terminology is likely wrong here) will that add additional delay after purchasing the new hardware?
Thank you sir!
There should be no problems with cooling for the GDDR5X memory with the normal card layout and fans. I know for HBM2 NVIDIA actually designed the memory to be actively cooled, but HBM2 is stacked while GDDR5X is not. Generally GDDR5X is very similar to GDDR5 memory. It will consume less power but also offer higher density, so that on the bottom line GDDR5X should run on the same temperature level or only slightly hotter than GDDR5 memory — no extra cooling required. Extra cooling makes sense if you want to overclock the memory clockrate, but often you cannot get much more performance out of it for how much you need to invest in cooling solutions.
Overall the architecture of Pascal seems quite solid. However, most features of the series are a bit crippled due to manufacturing bottlenecks (16nm, GDDR5X, HBM2 all these need their own factories). You can expect that the next line of Pascal GPUs will step up the game by quite a bit. The GTX 11 series probably will feature GDDR5X/HBM2 for all cards and allow full half-float precision performance. So Pascal is good, but it will become much better next year.
Cool thanks. That gave me something to chew on.
Last question (hopefully for at least a week : ) ): Do you think that a standard hybrid cooling closed-loop kit (like this one from Arctic: https://www.arctic.ac/us_en/accelero-hybrid-iii-140.html) will be sufficient for deep learning or is a custom loop the only way to go?
– VRM: heatsink + fan
– VRAM: Heatsink ONLY
– GPU: closed-loop water cooled
Obviously will have to confirm the physical fit once those specs become more available, but insofar as the approach, I was a little bit concerned about the VRAM.
The use case is convolutional networks for image and video recognition.
Thanks,
Selly
I want to build my own deep learning machine using skylake motherboard and cpu. I am planing not to use more then 2 GPUS (GTX 1080). Starting with one GPU first and upgrading to a second one if needed.
here is my setup in pcpartpiker: http://pcpartpicker.com/user/bmahak2005/saved/Yn9qqs
Please tell me what you think about it.
Thanks again for a great article .
HB.
The motherboard and CPU combo that you chose only supports 8x/8x speed for the PCIe slots. This means you might see some slowdown in parallel performance if you use both of your GPUs at the same time. The decrease might vary between networks with roughly 0-10% performance loss. Otherwise the build seems to be okay. Personally I would go with a bit more watts on the PSU just to have a save buffer of extra watts.
hi tim,
thanks for some really useful comments. i have a hardware question. i’ve configured a Windows 10 machine for some GPU computing (not DL) at the moment. I think the hardware issues overlap with your blog, so here goes:
the system has a GTX 980 Ti card and a K40 card on an ASUS X-99 Deluxe motherboard. When the system boots up, the 980 (which runs the display as well) is fine, but the K40 gives me “This device cannot start. (Code 10). Insufficient system resources exist to complete the API”. I have the most up-to-date drivers (354.92 for K40, 368.81 for 980).
Has anyone configured a system like this, and did they have similar problems? Any ideas will be greatly appreciated.
It might well be that your GPU driver is meddling here. There are separate drivers for Tesla and GTX GPUs and you have the GTX variant installed and thus the Tesla card might not work properly. I am not entirely sure to go around this problem. You might want to configure the system as a headless (no monitor) server with Tesla drivers and connect to it using a laptop (you can use remote desktop using Windows, but I would recommend installing ubuntu).
How do the new NVIDIA 10xx compare? I followed through with this guide and ended up getting a GTX Titan. The bandwidth looks slightly higher for the Titan series. Does the architecture affect learning speeds?
The bandwidth is high for all Titans, but their performance is different from architecture to architecture, for example Kepler (GTX Titan) is much slower than Maxwell (GTX Titan X) even though the have comparable bandwidth. So yes the architecture does affect learning speed — quite significantly so!
How are you so patient with everyone’s questions ?
There are several reasons:
– I led a team of 250 in an online community and people often asked me for help and guidance. At first I sometimes lend support and sometimes I did not. However, over time I realized that not helping out can produce problems: Demotivate people from something which they really want to do but do not know how to do, produce defects in the social environment (when I do not help out, others would take example from my actions and do the same) among others. Once I start lending a hand always, I found that I do not lose as much time as I thought I would lose. Due to my vast background knowledge in this online community, it often was faster to help than thinking about if some question or request was worth of my help. I now always help without a second thought or at least start helping until my patience grows tired
– Helping people makes me feel good
– I was born with genes which make me smart and which make me understand some things easier than others. I feel that I have a duty to give back to those which were less fortunate in the birth lottery
– I believe everybody deserves respect. Answering questions which are easy for me to answer is a form of respect
I hope that answers your question 🙂
You are an amazingly good person Tim. The world needs more people like you. Your actions encourage others to behave in a similar way which in turn helps build better online and offline communities. Thank you!
Thank you for the kind words!
Thanks for the great guide.
I had a question. What is the minimum build that you recommend for hosting a Titan X pascal?
For a single Titan X Pascal and if you do not want to add another card later almost any build will do. The CPU does not matter; you can buy the cheapest RAM and should have at least 16 GB of it (24 GB will be more than enough). For the PSU 600 watts will do; 500 watts might be sufficient. I would buy a SSD if you want to train on large data sets or raw images that are read from disk.
Hi Tim
Thanks a lot for your useful blog.
I am training CNN on CPU and GPU as well.
Although the weights are randomly initialized , but I am setting the random seed to zero in the beginning of the training. Still, I am getting different weights learnt for CPU than GPU. The difference is not huge (e.g. -0.0009 and -0.0059, or 0.0016 and 0.0017), but there is a difference that I can notice. Do y ou have any idea how this could be happening? I know it is a very broad question, but what I want to ask is, is this expected or not?
I am using MatlabR2016a with MatConvNet 1.0 beta20 (Nvidia Quadro 410 GPU in Win7 and GTX1080 in Ubuntu 16.04), Corei7 4770 and Corei7 4790.
Exactly same data with same network architecture used.
Best Regards
Wajahat
This can well be true and normal. The seed itself can produce different random numbers on CPU and GPU if different algorithms are used. Convolution on GPUs mal also include some non-deterministic operations (cuDNN 4). When using unit tests to compare CPU and GPU computation, I also often have some difference in output given the same input, thus I assume that there are also small differences in floating point computation (although very small). All this might add up to your result.
Hi Tim,
Many thanks for this post, and your patient responses. I had a question to ask – NVIDIA gave away Tesla K40C (which is the workstation version of K40, as I understand) as part of its Hardware Grant Program (I think they are giving TitanX now, but they were giving Tesla K40Cs until recently). It’s not clear to me what workstations from standard OEMs like Dell/HP are compatible with a K40C. I have spoken to a few vendors about compatibility issues, but I don’t seem to get convincing responses with knowledge. I am concerned about buying a workstation, which would later not be compatible with my GPU. Would it be possible for you to share any pointers you may have?
Thank you very much in advance.
The K40C should be compatible with any standard motherboard just fine. The compatibility that hardware vendors stress if often assumed for datasets where the cards run hot and need to do so permanently for many months or years. The K40 has a standard PCIe connector and that is all that you need for your server motherboard.
I just started learning about neural networks and I’m looking forward to studying it. I have a gt 620 with a dual core pentium g2020 clocked at 3.3 ghz with 8gb of ram. Would it be better to buy a 1060 and two 8gb rams for the future?
Yes, the GT620 will not support cuDNN which is important deep learning software and makes deep learning just more convenient, because it allows you more freedom in choosing your deep learning framework. You will have less troubles if you buy a GTX 1060. 16GB of RAM will be more than enough, I think even 8GB could be okay. Your CPU will be sufficient, no update required.
Hi,
I just bought two Supermicro 7048GR-TR server machine with 4 TitanX cards on each machine. Im confused how to configure the server. How many partitions I have to make, how to utilize 256GB SSD drive and two other 4TB hard drives in each machine. The server will be only used for deep learning applications. What deep learning framework should I use (TensorFlow or Caffe or Torch) considering two servers. I work in medical imaging domain. I recently started getting used to deep learning domain. Please help me with your valuable suggestions.
Link for server configuration:
https://www.supermicro.com.tw/products/system/4u/7048/SYS-7048GR-TR.cfm
Thanks and Regards
sk06
The servers have a slow interconnect, that is the servers only have a gigabit Ethernet which is a bit too slow for parallelism. So you can focus on setting up each server separately. It depends on your dataset size, but you might want to have the SSD drive dedicated for your datasets, that is, install the OS on the hard drive. If your datasets are < 200GB, you could also install the OS on the SSD to have a smoother user experience. The frameworks all have their pros and cons. In general I would recommend TensorFlow, since it has the fastest growing community.
Thanks for the suggestions. I tried training my application with 4 gpus in the new server. To my shock, training the alexnet took 2.30 Hrs with 4 gpus while training the alexnet took 35 mins with single gpu. I used caffe for this. Please let me know where am I going wrong..! The batch size and other parameter settings are same as in the original paper.
Thanks and Regards
sk06
First of all, really nice blog and well made articles.
Do you think that spending 240£ more for a 1070 (2048 CUDA cores) instead of a 1060 (1280 CUDA cores) for a laptop? Does the complexity of the most used deep learning algorithms require the extra 760 CUDA cores?
Thank you.
I am not sure how easy it is to upgrade the GPU in the laptop. If it is difficult, this might be one reason to go with the better GPU since you will probably also have it for many years. If it is easy to change, then there is not really a right/wrong choice. It all comes down to preference, what you want to do and how much money you have for your hardware and for your future hardware.
Hi Tim,
I had a question about the new pascal gpu’s. I am debating between Gtx 1080 and Titan X. The price of Titan X is almost double the 1080’s. Excluding the fact that Titan X has 4 more Gb memory, does it provide significant speed improvement over 1080 to justify the price difference?
Thanks,
Hi,
I am not Tim (obviously), but as far as I understood from his other post on GPU (http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/) he states that for research level of work it actually is a difference, maxime when you are are using videosets. But for example … “While 12GB of memory are essential for state-of-the-art results on ImageNet on a similar dataset with 112x112x3 dimensions we might get state-of-the-art results with just 4-6GB of memory.”
Hope this can help you.
If you can afford it the TITAN X is DEFINITELY worth it over the 1080 in most cases. Not only does it have that 12GB of VRAM to work with but it also has features like INT8 (the way i understand it, is that you can store floats as 8 bit integers which helps efficiency etc.. Potentially quite useful) and has 44 TOP units (kinda like ROPs but not for graphic rendering, they are beneficial to Deep Learning though)
Basically the TITAN X is literally identical to the $7000 Tesla P100 just without the Double Precision FP64 capability and without HBM2 memory (The TITAN X uses GDDR5X instead, however it’s not much of a difference as the P100’s memory bandwidth even with the HBM is only 540 GB/second whereas the TITAN X is very close at 480 GB/second and hits 530 GB/second when you overclock the memory from 10,000mhz to 11,000mhz so it’s literally no difference really) Other than those things and the certified Tesla Drivers there’s literally no real difference between the P100 and the TITAN X Pascal; which is very important as the Tesla P100 is literally THE most powerful graphic card on the planet right now!
The important thing to mention is that Double Precision isn’t really important for Neural nets etc.. that you deal with in Deep Learning; so for $1,200 you are getting the power of the $7,000 monster supercomputer chip of the Tesla P100 just without all the unnecessary server features that Deep Learning doesn’t use.
Also, in comparison to the GTX 1080, the TITAN X has a significant advantage in both memory capacity (12GB vs 8GB on 1080), memory bandwidth (530 GB/s when overclocked on the TITAN X, vs 350 GB/s on the 1080 when overclocked…that’s a FIFTY PERCENT increase in memory bandwidth!), and has a massive increase in CUDA cores which is very beneficial (40% more, which when combined with the double memory capacity and 50% higher bandwidth easily nets you ~60% more performance in some scenarios over the 1080)
Hope this helps, the TITAN X is a GREAT chip for Deep Learning, the best in the world currently available in my opinion. Which is why i bought two of them.
(sorry for the long post but it is important to your decision so try to read it all if you have time)
Hey, correcting an error in my earlier post. LIke i said i wasn’t quite sure if i understood the INT8 functionality properly. and i was wrong about it. Apparently there was a typo in the spec pages of the Pascal TITAN X, it said “44 TOPs” and made me think it was an operation pipeline of sorts similar to a “ROP” which is responsible for displaying graphical images etc..
It actually was referring the the INT8, which is basically just 8 bit integer support. The average GPU runs with 32 bit “full precision” accuracy, which is a measurement of how much time and effort is put into each “calculation” made by the GPU. For example, with 32 bit it may only go out to 4 decimal points when calculating for the physics of water in a 3d render etc.. which is plenty good for things like Video Games and your average video editing and rendering project; but for things like advanced physics calculations by big universities that are trying to determine the 100% accurate behavior of each individual molecule of H2O within the body of water to see EXACTLY how it moves when wind blows etc.. you would need “double precision” which is a 64 bit calculation that would have much more accuracy, going to more decimal points before deciding that the calculation is “close enough” compared to what 32 bit would.
Only special cards like Quadro’s and Tesla’s have high 64 performance, they usually have half the Teraflops of performance at 64 bit mode compared to 32 bit, so a Quadro P6000 (same GPU as the TITAN XP but with full 64 bit support) it has 12 Teraflops of power at 32 bit mode and ~6 Teraflops of power in 64 bit mode. But there is also 16 bit mode, “half precision” for things requiring even less accuracy, INT8 to my understanding is basically “8 bit quarter precision” mode, with even less focus on total mathematical accuracy; and this is useful for Deep Learning as some of the work done doesn’t require that much accuracy,.
So, in other words, in 8 bit mode, the TITAN X has “44 Teraflops” of performance.
Your analysis is very much correct. However, for some games there are already some elements which make heavy use of 8-bit Integers. However, before it was not possible to do 8-bit Integer computation, but you had to first convert both numbers to 32-bit, then do the computation, and then convert it back. This would be done implicitly by the GPU so that no programming was necessary. Now the GPU is able to do it on its own. However, the support is still quite limited so you will not the 8-bit deep learning just yet. Probably in a year earliest would be my guess, but I am sure it will arrive at some point.
Hi Tim,
first of all thank you for sharing all these precious information.
I am new to neural network and python.
I want to test some ideas on financial time series.
I’m starting to learn python, theano, keras.
After reading your article, I decided to upgrade my old pc.
I know almost nothing about hardware so I ask you an opinion about it.
Current configuration:
– Motherboard: Gigabyte GA-P55A-UD3 (specification at: http://www.gigabyte.com/products/product-page.aspx?pid=3439#sp)
– Intel i5 2.93 GHz
– 8 Gb Ram
– GTX 980
– PSU power: 550watts
I may add:
– Ssd Hard Drive (I will install Ubuntu and use it only by command line – not graphical interface)
The power supply is powerful enough for the new card?
Does the motherboard support the new card?
Thank you very much,
Gilberto
The motherboard should work, but it will be a bit slower. The PSU is borderline, it might be a bit too few watts or just right, its hard to tell.
Hi Tim
Is this one a good one for a Deep Learning Researchers?
https://www.bhphotovideo.com/c/product/1269213-REG/asus_g20cb_db71_gtx1070_republic_of_gamers_g20cb.html
thank you!
It is a bit pricey and there are not much details about the motherboard. Also the GPU might be a bit weak for researchers.
I would also encourage you to buy components and build them together on your own. This may seem like a daunting task but it is much easier than it seems. This way you get a high-quality machine that is cheap at the same time.
I am using Asus K55VJ, i5 3rd gen, Nvidea Geforce GT 635M- 2GB, with 750HDD and 8GB RAM. Does my computer supports deep learning?
Your GPU has compute capability of 2.1 and you need at least 3.0 for most libraries — so no, your computer does not support deep learning on GPUs. You could still run deep learning code on the CPU, but it would be quite slow.
Awsome! Thanks for your sharing. Can you tell me how much will them cost to build up such a cluster? Cheers!
Basically it is two regular deep learning systems together with infiniband cards. You can get infiniband card and a cable quite cheap on eBay and the total cost for a 6 GPU, 2 node system would be about 3k for the system and infiniband cards, and an additional 6k for the GPUs (if you use Pascal GTX Titan X) for a total of $9k.
My current CPU is Intel Core i3 2100 @ 3.1Ghz and RAM is 4GB. My motherboard is Gigabyte GA-H61M-S2P-B3 (rev. 1.0) . It has support PCIe 2.0. Can I use GTX 1060 in my current configuration or do I need to change the board and the CPU? I want to keep the cost as much as low.
You should be able to run a GTX 1060 just fine. The performance should be only 5-10% less than on an optimal system.
Hi Tim,
I just got 5 dell precision t7500 in an auction.
haven’t received them yet, but the description mentions Nvidia Quadro 5000 installed.
Would it be worth replacing them or are they enough for starting out ?
Machines themselves have 12GB of DDR3(ECC i presume) RAM and Xeon 5606 as described.
The Quadro 5000 has only a compute capability of 2.0 and thus will not work with most deep learning libraries that use cuDNN. Thus it might be better to upgrade.
Thanks.
I am thinking of going with GTX 1060.
Is there any difference though between EVGA , ASUS, MSI or NVIDIA versions ?
These are the options I see when search on ebay .
That should matter much. Don’t go with the Nvidia founder’s edition. It doesn’t have a good cooling system. Just go with the cheapest one which is EVGA. It is one of the most promising brand. I just ordered the EVGA one.
Please note that the GTX 1080 EVGA has currently cooling problems with are only fixed with flashing the BIOS of the GPU. This card may begin to burn without this BIOS update.
Hi Tim,
Could you recommend any Mellanox ConnectX2 cards for GPU-RDMA ?
Some are just Ethernet (MNPA19 XTR, for e.g. ) and I wonder if those can be flashed to support RDMA or maybe I should just buy a card which supports Infiniband outright ?
Hi Tim
Thanks for the great article and your patience to answer all the questions. I just built a dev box with 4 Titan X Pascal and need some advice on air flow. For reference, here is the Part list: https://pcpartpicker.com/list/W2PzvV and the Picture: http://imgur.com/bGoGVXu
Loaded Windows first for stress testing the components and noticed the GPUs temps reached 84C while the fans are still at 50%. Then the GPUs started slowing down to lower/maintain the temp. Then with MSI Afterburner, I could specify a custom temp-vs-fanspeed profile and keep the GPU temps at 77C or below – pretty much what you wrote in the cooling section above.
There is no “Afterburner” for Linux, and apparently the BIOS of the Titan X Pascal is locked so we can’t flash them with custom temp setting. The only option left for me is to play with the coolbits and I prefer not to attach 4 monitors to it (I already have two 30inch monitors that are attached to a windows computer that I use for everything. 6 monitors on the table will be too much).
I wonder if you found any new way of emulating monitors for Xorg as my preferred option would be keep 3 of the GPUs headless ?
Cheers
Ashiq
I did not succeed in emulating monitors so myself. Some other claim that they got it working. I think the easiest way to increase the fan speed would be to flash the GPU with a custom BIOs. That way it will work in both Windows and Linux.
Not sure, but there maybe there exist specific dummy plugs to help “emulating” monitors, if it’s not possible purely by software. At least DVI and HDMI-dummy plugs worked for cryptocurrency miners back in the day.
So I got it (virtual screens will coolbits) working by following the clues from http://goo.gl/FvkGC7. Here (https://goo.gl/kE3Bcs) is my Xserver cofig file (/etc/X11/xorg.conf) and I can change all 4 fan speeds with nvidia-settings
Thanks Ashiq — that sounds great! Thank you for sharing the link!
Hey, I wanted to ask if the nvidia quadro k4000 will be a good choice for running convolutional nets?
A K4000 will work, but it will be slow and you cannot run big models on large datasets such as ImageNet.
Shall I get a GTX1080 instead?
Great hardware guide. Thank you for sharing your knowledge.
This is good overview on the HW that matters to the DL, Would like your view on the OpenPower -NVIDIA combo, and economics of setting up a ML/DL lab.
I think that non-consumer hardware is not so economically efficient for setting up a ML/DL lab. However, beyond a certain threshold of GPUs the traditional consumer hardware no longer is an option (NVIDIA will not sell you consumer-grade GPUs on bulk and there might also be problems with reliability). I would recommend to get as much traditional, cheap, consumer hardware as possible and mix it with some HPC components like cheap Mellanox Infiniband cards and switches from eBay.
Hi Tim,
Thank you for sharing your knowledge it was very much beneficial to understand the concepts in DL.
I have a doubt
How to feed custom images into CNN , for object recognition using Python language.Please give some pointers on this.
You will need to rescale custom images to a specific size so that you can feed your data into a CNN. I recommend looking at ImageNet examples of common libraries (Torch7, Tensorflow) to understand the data loading process. You will then need to write an extension which resizes your images to the proper dimension, for example 1080×1920 -> 224×224.
Firstly, I am very thankful for your post. It is very nice and very helpful.
One thing I wanted to point is; you can feed the images into network (in caffe) as they are. I mean if you have 1080×1920 image, there is no need to reshape it to 224×224. But, this does not mean that feeding the image as is perform better, I think this can be standalone research topic 🙂
Secondly, I am planning to buy a desktop PC; and since I am a Deep Learner Researcher (beginner) I am going to do a lot experiments on ImageNet, etc large scale datasets. Do you suggest to buy the gaming PCs directly, or would it be wise choice to build my own PC?
I was considering to buy Asus ROG G20CB P1070.
Thank you very much in advance!
Regards,
Building your own PC would be a better choice in the long-term. It can be daunting at first, but it is often easier than assembling IKEA furniture and unlike IKEA furniture there are multitude of resources on how to do it step-by-step. After you have build your first desktops, building the next desktops will be easy and rewarding and you will save a lot of money to boot!
Thank you Tim!
Great article, and thanks for sharing!
I want to configure my working layout like yours: “Typical monitor layout when I do deep learning: Left: Papers, Google searches, gmail, stackoverflow; middle: Code; right: Output windows, R, folders, systems monitors, GPU monitors, to-do list, and other small applications.”
Do I need extra configuration in addition to connect 3 monitors to the motherboard? Is there any additional hardware need for this 3 monitors configuration?
Thanks!
No extra configuration is required other than the normal monitor configuration for your operating system. Your GPU needs to have enough connectors and support 3 monitors (most modern GPUs do).
Hi Tim ,
Wonderful article . However I am about to buy a new laptop . So what do you feel about the idea of gaming laptop for deep learning with Nvidia GTX 980 M , GTX 1060/1070 ?
Definitely go for the GTX 10 series GPUs for your laptop since these are very similar to full desktop GPUs. They are probably more expensive though. Another option would be to buy a cheap, light laptop with long-battery duration and a separate desktop to which your connect remotely to run your deep learning work. The last option is what I use and I am quite fond of it.
I am very happy that I thought as you did. I bought Macbook Air which is very portable, and going to buy a desktop with better specifications to do my experiments on it.
I had a question but I have asked it in previous comment.
Thank you again for the very useful information.
Regards,
You are welcome! I am glad I could help out!
Tim
As I website possessor I believe fpfoggd the content material here is rattling excellent , appreciate it for your hard work. You should keep it up forever! Best of luck.
Thank you! I aim to keep it up forever 🙂
I am confused between two options:
1) A 2nd Generation core i5, 8GB DDR3 RAM and a GTX 960 for $350.
2) A 6th Generation core i3, 16GB DDR3 RAM and a GTX 750Ti for $480.
Can you please comment? I expect to upgrade my GPU after a few months.
A difficult choice. If your upgrade your GPU in a few months then it depends if you use your desktop only for deep learning or also for other tasks. If you use your machine regularly, I would spend the extra money and go for option (2). If you want to almost exclusively deep learning with the machine (1) is a good, cheap choice. Here the choice also depends if you buy the 2GB or 4GB variant of each GPU. In terms of speed (1) will be about 33-50% faster, but the speed would not be too important when you start out with deep learning, specially if you upgrade the GPU eventually.
Thank you Tim, you really inspire me! Actually I took the Udacity SDCND course, and here is the list of a few projects I want to accomplish on a local machine:
1. Road Lane-Finding Using Cameras (OpenCV)
2. Traffic Sign Classification (Deep Learning)
3. Behavioral Cloning
4. Advanced Lane-Finding (OpenCV)
5. Vehicle Tracking Project (Machine Learning and Vision)
So, my work is solely related to Computer Vision and Deep Learning. I also have an option to a GTX 1060 6GB with that core i3 (2). Off course, I expect to code the GPU versions of OpenCV tasks. Do you think this 3rd option would be sufficient to accomplish these projects in average amount of time? Thank you again.
Hi Shahid. I’m in the same boat as yours. Even I have signed up for the SDCND. I have an old PC with core i3 and 2GB RAM. I am adding additional 8GB RAM and buying GTX 1060 6GB. This is a really powerful GPU which’ll perform great in our work associated with the SDCND.
Hi Tim,
Thank you for this excellent guide.
I was wondering, now that the new 1000 series and Titan X came out, what are your updated suggestions for GPUs (no money, best performance, etc)?
Please, see my GPU blog post for these updates.
Thank you, Tim. I ended up buying a GTX 1070.
Now, I have to purchase the MOBO. I’m deciding between a GIGABYTE GA-X99P-SLI and a Supermicro C7X99-OCE-F.
Both support 4 GPUs but it seems that there is not enough space for a 4th GPU on the Supermicro. Any experience with these MOBOs?
This is my draft https://pcpartpicker.com/list/6tq8bj
Indeed, the Supermicro motherboard will not be able to hold a 4th GPU. I also have a Gigabyte motherboard (although a different one) and it worked well with 4 GPUs (while I had problems with an ASUS one), but I think in general most motherboards will work just fine. So seems like a good choice.
Hi Tim,
I am willing to buy a full hardware to deep learning,
my budget is about 15,000$
I don’t have any experience in this and when I tried to check things out it was too complicated for me to understand,
Can you help me ? maybe recommend about companies or anything else that suits my budget and still be good enough to work with?
Thanks a lot
If I were you I would put together a PC on pcpartpicker.com with 4 GPUs and then build it together by myself. This is the cheapest option. If that is too difficult, then I would look for companies that sell deep learning desktops. They basically sell the same hardware, but at a higher price.
Thank you very much for writing this! – knowing something about how to evaluate the hardware is something I have been struggling to get my head around.
I have been playing with TensorFlow on the CPU on a pretty nice laptop (fast i7 with lots of RAM and an SSD but ultimately dual core so slow as hell).
I want try something on the GPU to see if it is really just 100’s of times faster, but I am worried about investing too much too soon as I have not had a desktop in ages.. having read this post and the comments I have the following plan:
Use an existing freenas server I have as a test bed and buy a relatively low end GPU – GTX 960 4096MB:
https://www.overclockers.co.uk/msi-geforce-gtx-960-4096mb-gddr5-pci-express-graphics-card-gtx-960-4gd5t-oc-gx-319-ms.html
The freenas box has a crappy celeron core 2 3.2 dual core and only 8GB of Ram.:
http://ark.intel.com/products/53418/Intel-Celeron-Processor-G550-2M-Cache-2_60-GHz
I will buy the graphics card and an SSD to install an alternative OS on, I *may* upgrade the ram and processor too as all of these items will all benefit the freenas box anyway (i also run plex on it).
If this goes well and I develop further I will look at a whole new setup later with appropriate motherboard, cpu, etc. but in the mean time i can learn how to to identify where my specific bottle necks are likely to be etc.
From what you have said here i think there will be several slow parts to my system but I am probably going to get 80-90% of the speed of the graphics, the main restriction being that the cpu only supports PCIe 2.0 – as everything else while not ideal and scale-able for that GPU can probably feed it fast enough.
I have 2 questions (if you have time – sorry for long comment but i wanted to make my situation clear):
1. Do you see anything drastically wrong with this approach? – no guarantees obviously, I could spend more money now if i am just shooting myself in the foot but i would rather save it for the next system once i am fully committed and have more experience.
2. I chose the GPU based on RAM, number of CUDA cores and Nvidia compute capability rating (which reminds me of windows performance rating 😀 – a bit vague but better than nothing).. the other one i was considering was this £13 more so also a fine price imho:
https://www.overclockers.co.uk/palit-geforce-gtx-1050ti-stormx-4096mb-pci-express-gddr5-graphics-card-gx-03t-pl.html
Which has less cores 768 vs 1024 but a shorter process length, higher speed 1290MHz vs 1178MHz, and i *think* i higher rating assuming that the Ti is just better (seems to mean unlocked) 6.1 vs 5.2:
https://developer.nvidia.com/cuda-gpus#collapse4
Basically is the drop in cores really made up for to such a drastic extent that this significantly higher rating from nvidia is accurate.. noting that i am probably going to be happy enough either way – feel free to just say “either is probably fine” 😀
Alternatively if there is something else in the sub £150 ish range that you would suggest given that the whole thing may be replaced by a titan x or similar (hopefully cheaper after Christmas 😉 ) if this goes well. I did consider just getting something like this: much less ram but still was more cores than 2 and allows me to figure out how to get code running on the GPU:
https://www.overclockers.co.uk/asus-geforce-gt-710-silent-1024mb-gddr3-pci-express-graphics-card-gx-396-as.html
Got the 1050 Ti (well another variation of it), i figured they would be similar regardless so i might as well trust nvidias rating.
https://www.amazon.co.uk/gp/product/B01M66IJ55/
Also got 32 GB or ram and a quadcore i5 that supports pci 3.0 as they were all cheap on ebay. (SSD too of course).
Looks like i can mount my zfs pool in ubuntu so i will probably just take freenas offline for a while and use this as a file and plex server too (very few users anyways) and this way my raid array will be local should i want to use it.
That sounds solid. With that you should easily get started with deep learning. The setup sounds good if you want to try out some deep learning on Kaggle.com for example.
Upgrading the system bit by bit may make sense. Note that CPU and RAM will make no difference to deep learning performance, but might be interesting for other applications. If you only use one GPU a PCIe 2.0 will be fine and will not hurt performance. The GTX 960 and GTX 1050Ti are on a par in terms of performance. So pick what is most convenient / cheaper for you.
Hi Tim,
You are such a amazing person. So patient and knowledgable.
I am also in the same boat of deep leaning and willing to learn.
I brought this computer
http://www.costco.com/CyberpowerPC-SLC2400C-Desktop—Intel-Core-i7—8GB-NVIDIA-GeForce-GTX-1080-Graphics—Windows-10-Professional.product.100296640.html
CyberpowerPC SLC2400C Desktop – Intel Core i7 – 8GB NVIDIA GeForce GTX 1080 Graphics – Windows 10 Professional
This is gaming PC but i don’t play games.
My question is can i use “Titan Pascal X” from Nvidia along with GeForce GTX 1080 for more computation power.
I learned SLI is not solution and anyways both are different GPUs .
So in order to achieve faster result can i combine both GPU for Tensorflow.
I am using tensorflow –
I just found this – (Basic Multi GPU Computation in TensorFlow)
https://tensorhub.com/donnemartin/4_multi_gpu
I need to install a VM with Ubantu 16 for all this setup.
Thanks
Hi Om,
I am really glad that you found the resources of my website useful — thank you for your kind words!
The thing with the NVIDIA Titan X (Pascal) and the GTX 1080 is that they use different chips which cannot communicate in parallel. So you would be unable to parallelize a model on these two GPUs. However, you would be able to run different models on each GPU, or you could get another GTX 1080 and parallelize on those GPUs.
Note that using a Ubuntu VM can cause some problems with GPU support. The last time I checked it was hardly possible to get GPU acceleration running through a VM, but things might have changed since then. So I urge you to check if this is possible first before you go along this route.
Best,
Tim
Hi Tim, thanks for the excellent posts, and keep up the good work.
I am just beginning to experiment with deep learning and I’m interested in generative models like RNNs (probably models like LSTMs, I think). I can’t spend more than $2k (maybe up to $2.3k), so I think I will have to go with a 16-lane CPU. Then I have a choice of either a single Titan X Pascal or two 1080s. (Alternatively, I could buy a 40-lane CPU, preserving upgradability, but then I could only buy a single 1080). Do you have any advice specific to RNNs in this situation? Is model parallelism a viable option for RNNs in general and LSTMs in particular?
Thank you!
I think you can apply 75% of state-of-the-art LSTM models on different tasks with a GTX 1080; for the other 25% you can often create a “smarter” architecture which uses less memory and achieves comparable results. So I think you should go for 16 lanes and two GTX 1080. Make sure your CPU support two GPUs in a 8x/8x setting.
Should i buy a GTX 1080 now or wait the ti which is supposedly coming out next month?
The GTX 1080 Ti will be better in all of the ways. Make sure however to preorder it or something, otherwise all cards might be bought up quickly and you have to go back to the GTX 1080. Another strategy might be to wait a bit longer for the GTX 1080 Ti to arrive and then buy a cheap GTX 1080 from eBay. I think these two choices make sense if you can wait for a month or two.
Hi,
What do you think of the following build ?
https://pcpartpicker.com/list/8sv2jc
Thank you
Thank you for your reply.
I appreciate it.
I would personally recommend a couple small changes. Here’s what i would go with: https://pcpartpicker.com/list/bx2ssJ
First off, if you are going to spend $85 on a 256GB regular SATA based SSD for storage then you might as well get the top of the line M.2 960 Evo for $120. It’s over 3 times faster than the one you picked in transfer speeds, and is overall much better. (alternatively if you don’t care about the extra speed you can get a 500GB SATA drive for about that same price, getting double the storage)
The second thing i would change is to get a Z270 motherboard rather than a Z170. It’s been a month or so since you commented so not sure if you bought yours yet, but the new Z270 motherboards support more PCI lanes, support 4K encoding on 7000 series CPUs etc.. so they’re worth looking at, especially since they’re basically the same price. My link swapped in a Gigabyte Z270 Gaming K3 for your Gigabyte Z170 Gaming M3. Very Similar boards.
Lastly, you should also get the i5 7600K instead of the i5 6600K since Kaby Lake 7000 processors are about 5-10% faster than Skylake 6000 processors, and the 7600K can be overclocked to over 5ghz no problem compared to the 6600K that has trouble getting over ~4.7ghz on air cooling in some cases. And since the 7600K is also about the same price you might as well get it. Personally though i would still recommend an i7 over an i5 in this situation simply because simultaneous multi-threading is becoming fairly more important as of late, and the extra 2MB of L3 cache is also nice to have. I figure if you are spending $1200 on a TITAN X Pascal you should be able to fit in $100 more for an i7 7700K that can also be overclocked to 5ghz pretty easily in most cases (even on air!)
Hi Tim,
I’ve just upgraded from GTX 960 to GTX 1070.
I used to run the tensorflow cifar10_multi_gpu_train.py file to check the speed from one release to another. With the last tensorflow release it peak at about 1500 images / sec with GTX 960 (which is an impressive progress btw, with the initial releases it was more like 750 images / sec).
I was suprised to see that my GTX 1070 peaks at ~1700 images / sec, a very small improvement. It looks like the CPU is now the bottleneck (I see it constantly at 300% usage – 3 full cores). I have a i5-3570k which should be decent.
I didn’t analysed it further (yet) but could samebody share an experience on that ? I wasn’t expecting the CPU to be the bottleneck here.
Are you training on multiple GPUs (cifar10_multi_gpu_train.py)? If so, then this is your answer. TensorFlow has terrible performance for multiple GPUs and upgrading multiple GPUs will not yield much better performance for TensorFlow.
Hi,
What do you think of the following build ?
https://pcpartpicker.com/list/8sv2jc
Thank you
Tim, thanks for the great article! I have a couple of questions:
1. What is “4” in your mini-batch size calculation (4x128x244x244x3)?
2. I’m deciding on which SSD to buy for my machine with four Pascal Titan X cards, mostly to do training on Imagenet. Assuming your bandwidth estimate of 290MBps is for a single card, should I multiply it by four when running a model on all four cards? Do you know how fast Pascal Titan X processes a single 128 mini-batch? Also, if I use mini-batch of 256 , I would need double the bandwidth, right?
Given the above considerations, would you recommend going with a PCIe based SSD, such as Samsung 960 Pro, rather than SATA based one, such as Samsung 850 Evo?
1. The data used in deep learning is usually 32-bit or 4 bytes; this is the 4 in the calculation above (conversion into bytes).
2. This is a bit complicated. Parallelism does not scale linearly, so that you should multiply the estimate by 3.5 or so (for TensorFlow this will be closer to 2.5-3). One thing to keep in mind is that in practice small data transfers are often slower (the overhead is large when the data size is small) and that GPUs operate more efficient on larger batch sizes.
I am unfamiliar with the exact internals of the TensorFlow batching procedure. If they do it right for both data loading and data transfer a PCIe SSD would lend a bit improved performance. However, from some benchmarks it seems that some parts (GPU transfers I think) TensorFlow is sub-optimal. If this is really so for GPU transfers then a PCIe SSD and a normal SSD would lend the same performance. I personally would just go for a cheap normal SSD.
Thanks Tim. A different question: which software framework would you use for experimenting with Imagenet?
So far I’ve been using Theano, but only on small datasets (MNIST and CIFAR). My main interest is to test different quantization methods for weights and activations, and see how it works for different network architectures. I’ve read your paper, by the way, very interesting, but I prefer not to code everything from scratch in C/CUDA if possible. Right now I’m looking into implementation of the asynchronous batch allocation, like you suggested, in Theano, and it’s not very straightforward.
Would you recommend switching to TensorFlow, or sticking with Theano? I’m less concerned with the ability to parallelize code across multiple GPU, because I can just run different experiments in parallel.
TensorFlow is a good call. If you want to work on vision only Caffe is also an excellent option. However, overall PyTorch or Torch might be more suitable to you. PyTorch already implements asynchronous batching by default and Torch already has the 1-but quantization method. I am currently not sure how that is integrated into PyTorch, but since both Torch and PyTorch are only wrappers for lua and python, respectively, interfacing with 1-bit quantization should be relatively straighforward. If you want to implement other methods of quantization, then Torch and to some degree PyTorch some good interfacing and easy extension. However, the algorithms would need to be written in C/CUDA. Extending TensorFlow in this way might not be as straightforward, so you might run into difficulties either way. TensorFlow is of course still more popular and thus if you extend it, it will have more value for other people. So not an easy decision, but maybe I could give you some points which makes a decision easier.
So a single Titan Pascal trumps dual gtx 1080 in sli ?
Correct ?
Hi Tim,
Very good article! Thank you!
P.S. You have cool working place.
Hi Tim,
Could you comment on below build?
– Chassis: Corsair Carbide Air 540
– Motherboard: Asus ROG STRIX X99 GAMING ATX LGA2011-3 Motherboard
– Cpu: Intel Core i7 6800k
– Ram: 32GB DDR4 G.Skill 2400Mhz
– Gpu: 1 ASUS GTX 1080
– HD1: 500GB SSD Samsung EVO
– HD2: 1TB WD Red in RAID 5
I am not sure if the board is a good choice if I might be adding a second GPU in the future. Or maybe ASUS X99-Deluxe II is worth the extra cost?
Hi Usher,
I do not have time to check the details, but it seems that the motherboard is okay. The review on newegg are not that good though, but the cost/performance might still be good. Adding a second GPU will definitely no problem with the motherboard that you chose.
Otherwise the build looks okay. I recommend checking the build with pcpartpicker, which often finds compatibility issues if there are any.
Thanks Tim for your comment!
Hi Tim,
I have the latest mac and I want to use GPU – GTX Titan X with https://www.akitio.com/expansion/node
AKiTiO Node – eGPU box Thunderbolt 3
My question is, can I use TensorFlow with this external GPU device, without killing performance and efficiency. What could be the side effect?
I know using TitanX with the desktop will be a lot better but I need mobility.
Thanks and you rock 🙂
om
On the akitio specs it says that Mac is not supported.
Just wanted to add – AKiTiO reply –
Hello trulia.
You have a new message from .
Re: tensorflow
Message: Hi Trulia, We currently do not have an eGPU solution for the Mac and the only eGPU solution we have is for select Thunderbolt 3 PCs, so the answer to your question is No. Having said that, it might be possible that you could make it work but this would be more of a DIY project that requires hardware and software modification. Also, it would void the warranty, so this is not something that I can recommend. Regards, Stefan
Hello Tim and thank you for your post. I have currently a desktop with Core 2 Quad Q9300. I was wondering whether it would bottleneck a GTX1060 6GB for some beginner to mid DL problems?
It is an old CPU, but you should be relatively fine. You can expect to run about 10-20% slower than with a high end CPU. Probably processing some non-deep learning code, that is preprocessing data will take quite a bit more, but running the deep learning model should be not much slower.
Thank you for your reply. I also have an old motherboard GA-P43-ES3G (http://www.gigabyte.com/Motherboard/GA-P43-ES3G-rev-10#sp) which only supports PCI Express 2.0. I believe that will be a major bottleneck right?
Hi Tim,
I have been looking into using NVLink to couple two TXPs. I was hoping to do this in a SLI-like fashion (like shown here: http://www.kitguru.net/components/graphic-cards/anton-shilov/nvidia-pascal-architectures-nvlink-to-enable-8-way-multi-gpu-capability/ ), rather than buying a purpose-built motherboard. Unless I’m mistaken, this isn’t currently possible — do you know if NVIDIA has any plans to implement this in the future?
Thank you very much for this article and all of your helpful comments.
If you are interested in parallelism I recommend looking into Microsoft’s CNTK library. Their parallelisation algorithms, especially 1-bit quantization and block momentum, are so good that you get linear speedups without having NVLink. Granted, the software is a bit difficult to use but is maturing quickly and you could save a lot of money by going without NVLink. I am currently not aware of any affordable NVLink hardware which is used outside of supercomputing. You might get your hands on one of those machines, but it will be expensive. So in the end CNTK might be the only way to go which is practical. This may be disappointing, but I hope it helps!
Is the NEW Ryzen 1800X compatible with the GTX 1080ti ?
Will using theano as the backend work ?
Thank you
Ryzen will work perfectly fine with a 1080 TI. However depending on your work load Ryzen may not be the best option.
The Pro’s and Con’s of Ryzen 1800X is:
Pro: Ryzen has ECC RAM support which is great for mission critical situations where data CAN NOT risk being corrupted at any cost. However if you are mainly doing Deep Learning then ECC RAM is not really necessary at all as most Deep Learning algorithm’s and AI training etc.. can be done on 16 bit or even 8 bit precision (which is something that the TITAN X pascal excels at actually, thus why something like a Quadro or Tesla isn’t necessary either in most cases)
Pro: Ryzen has 8 cores, which are beneficial if you plan to do work with highly multi-threaded programs for video editing, 3d rendering etc.. although in many of these cases you are better off using GPU acceleration instead of relying on a CPU since CUDA acceleration on an Nvidia GPU (especially a TITAN X) will be FAR faster than ANY CPU. And again, if you are just doing Deep Learning mostly, or maybe some PC Gaming on the side etc.. and aren’t doing programs that need all those extra cores (Deep Learning only needs 4 cores even for four way SLI in most cases as shown in this article) then the extra cores of Ryzen are redundant frankly.
Con: Ryzen is limited to dual channel RAM. This cuts your memory bandwidth in half pretty much which CAN effect intensive Deep Learning work somewhat. It also only supports up to 2666-2900mhz RAM speeds in many cases which isn’t really a big deal for Deep Learning but will effect any memory intensive workstation/proffesional tasks. It also has a RAM capacity limit of 64GB compared to Intel X99 chipset used with CPUs like the i7 6800K etc.. that allows for 128GB of QUAD channel RAM clocked at up to 3600mhz. It’s up to your situation whether you consider that a problem or not.
Con: Ryzen has no overclocking capability to speak of. Nobody has really been able to get ANY Ryzen chip to get over 4.1ghz; with many even being stuck at 3.9gh zor 4.0ghz (which in the case of the 1800X is literally NO overclock at all since the 1800X runs at 4ghz out of the box). So if you are using programs that need clock speed then a faster chip would be beneficial.
So overall unless you really have specific need of an 8 core chip, i would say for a Deep Learning PC, even if you do things like normal web browsing, heavy PC gaming, video streaming/encoding etc.. you might be better off getting something like an i7 6800K (which has 6 cores 12 threads but can hit 4.4ghz in some cases so overall is a bit better) which is $100 cheaper than the R7 1800X; or perhaps the i7 7700K which is only $329 ($170 cheaper than R7 1800X) and can easily overclock to 5ghz with proper cooling (many people have hit 5.2ghz even with just the high quality noctua air coolers or AIO water coolers etc..) Only reason i would specifically get Ryzen is if you are really needing an 8 core chip for specific programs, as Deep Learning and most general use doesn’t require any more than 4 cores.
Keep in mind that 6800K has only 28 PCIe lanes (Ryzen and 7700k are even worse), so if you’re planning to use multiple GPUs (now or in the future), go with E5-1650 v4 (or E5-1620 v4 if you’re on a budget). Also, Skylake Xeons are about to be released (this month), so if you can, wait for them (mainly for AVX512 support).
Hi Michael,
I would like to use 4 GTX 1080 Ti
Which is the best and cheap processor with motherboard ?
Ah yes. Good catch.
Have you really noticed a difference between running a GPU with PCI-e 3.0 x8 and x16 for Deep Learning though? In most other situations i’ve seen having x8 PCI-e 3.0 isn’t hindering much at all, if any; you sometimes see a 0.5% or maybe 1% performance delta between the two but that’s typically it.
I haven’t seen this tested anywhere, but I’m guessing it’s important for large networks running on fast GPUs, when it takes longer to move gradients from GPU to GPU than to calculate them.
I am always happy to answer comments, but it give me even more joy to see that people answer each other’s questions. Thanks Michael and Andrew!
Yes, the AMD Ryzen CPU series will be compatible with your NVIDIA cards. In general, all modern CPUs should support NVIDIA cards. This is so because the CPU and NVIDIA communicate with a protocol that is in general used for printers, network interfaces and so forth, and there is no CPU manufacturer which can themselves not to support these features. Thus all CPUs should have support for NVIDIA GPUs (at least those which come as PCIe cards, which are all GPUs except the ones with NVLink, that is the NVIDIA P100 currently).
Hi,
Complete noobie build here.. So all aspects of all things computer needed.. I have been using a laptop till now and I would like to build a reasonably priced PC that can run CNNs. If needed I can tunnel in from anywhere to work with etc.
This is what I have.. would really appreciate any comments – have I missed anything?
Intel Core i5-6500 3.2GHz Quad-Core Processor
Corsair H60 54.0 CFM Liquid CPU Cooler
MSI B150 PC Mate ATX LGA1151 Motherboard
Kingston HyperX Fury Black 8GB (1 x 8GB) DDR4-2133 Memory
Western Digital Caviar Blue 1TB 3.5″ 7200RPM Internal Hard Drive
Asus GeForce GTX 1060 6GB 6GB Turbo Video Card
Phanteks ECLIPSE P400S TEMPERED GLASS ATX Mid Tower Case
Corsair CXM 550W 80+ Bronze Certified Semi-Modular ATX Power Supply
TP-Link TG-3468 PCI-Express x1 10/100/1000 Mbps Network Adapter
Gigabyte GC-WB867D-I PCI-Express x1 802.11a/b/g/n/ac Wi-Fi Adapter
Thank you!
Looks a solid build which offers some opportunities for upgrades in the future.
If I would do more data science I would probably go with cheap or used DDR3 CPU/RAM combo and buy more RAM (32-64GB); possibly I would swap the GTX 1060 for a GTX 1070 if I have the spare money left from switching from DDR4 to DDR3. If I would do more deep learning I would also go for a DDR3 CPU/RAM combo, possibly buy used hardware, and then buy a GTX 1080 Ti.
This does not mean that your build is bad. Your build is more future proof. My build would be more “I-want-to-do-things-now”. I guess this depends on taste, but be aware of what you want to buy when you buy hardware. Do you want to buy data science, deep learning, machine learning, Kaggle competitions, or being future proof? Your build buys all of that a little and a lot of being future proof, which can be a very sensible choice.
Hi,
You are awesome – thank you for the quick reply (because I have to get the laptop I am working with back asap)
– I want it for deep learning & machine learning primarily, either at the workstation or through a laptop that I can tunnel in with when needing a change of environment.
– I need it to last because I may not have another chance to buy anytime soon.
– In case this matters? I will be using Linux, probably Ubuntu flavour. It was challenging installing on ROG – had to use rpm for some reason.
If you don’t mind:
Where do I get used hardware from?
I can’t find the GTX 1080 ti on pcpartspicker.. its coming out this week looks like? Would you know the best place to get it through?
Oh – and should I look at an SSD for base installation? if so will I get away with one that is say 125GB?
Final build – for now:
Intel Core i5-6500 3.2GHz Quad-Core Processor 198.68 (w shipping)
Corsair H60 54.0 CFM Liquid CPU Cooler 59.99
MSI B150 PC Mate ATX LGA1151 Motherboard 84.78
Kingston HyperX Fury Black 8GB (1 x 8GB) DDR4-2133 Memory 68.99 (I tend to pull out memory and use it for other builds so kept the newer version)
Western Digital Caviar Blue 1TB 3.5″ 7200RPM Internal Hard Drive 49.99
Gigabyte GeForce GTX 1070 8GB Windforce OC Video Card 369.99
Rosewill GRAM ATX Mid Tower Case 49.99
Rosewill 600W 80+ Bronze Certified Semi-Modular ATX Power Supply (had to up for the new GPU) 59.99
TP-Link TG-3468 PCI-Express x1 10/100/1000 Mbps Network Adapter 9.22
D-Link DWA-552 PCI 802.11g/n Wi-Fi Adapter 9.95
Logitech K120 Wired Standard Keyboard 9.00
I am going to use a 32″ TV – hope that doesn’t kill my eyes..
And have a little mouse and speaker.
Total cost $970.57 (using used where possible)
Ashley, no, this is not how I’d spend a thousand bucks if I needed a cheap machine for DL. Instead of getting all these parts individually, I’d shop for a decent used desktop, then buy a good video card separately. For example, something like this: https://santabarbara.craigslist.org/sys/5992606383.html
Then you will have enough money left for GTX 1080 and more. The truth is, CPU performance haven’t improved that much in the last 5 years, so for deep learning an old CPU + 1080 will be faster than a new CPU + 1070.
Also, you should get a SSD. Again, old CPU + SSD will be faster than new CPU + hard drive.
p.s. and you definitely don’t need a liquid cooler (nor any overclocking).
@Michael – bummer it’s done.
I found this: https://annarbor.craigslist.org/sys/6031393982.html but I am getting a much better rig and half the price..
@Ashley: I’d probably just get this one (after getting the price down to $300, or $350, tops):
https://annarbor.craigslist.org/sys/6031436427.html
The advantage is it’s already got 1050 card in it, so you can start doing DL right away. Later, if you realize you need more power, you can buy 1080 Ti, and will still be within your $1k budget.
Thanks for the advice.
Hi Tim and all Deep Learning guys!
I have a i7-4790K CPU with 32Gb of RAM which should be fine for the beginning.
I’m planning to buy new GPU. I have a few options. 1060s is best seller and best chose price/performance, I guess. BTW here is list of non reference design PCBs http://thepcenthusiast.com/geforce-gtx-1060-compared-asus-evga-msi-gigabtye-zotac/
1. GTX 1060 3GB reference design 200e
+ really cheap
– poor over clocking results
– just 3GB of VRAM
2. GTX 1060 3GB non reference design (the PCB is based on GTX 1080 with better power feed and 8pin connector) 250e.
+ performance boost +5% 🙂
+ ability to over clock with volt mode
– price
– just 3GB of VRAM
3. GTX 1060 6GB ref design 260e
+ more CUDA cores
+ more VRAM
4. GTX 1060 6GB non reference design (the PCB is based on GTX 1080 with better power feed and 8pin connector. 280e – 300e
+ more CUDA cores
+ more VRAM
+ really good over clocking ability (+15%)
– quite expensive
– price / performance index is not so good any more
5. GTX 980 4Gb used with 1 year warranty 250e
+ more CUDA
+ 5% more performance than 1060 6GB version
– used
– less VRAM
– more power consumption
So what you think about above options? What is more important more VRAM or CUDA core number or GPU clock speed or VRAM bandwidth?
Thank you for sharing your experience!
If you have missed it you might want to check out my other blog post about GPU selection: GPU advice. To reiterate the points:
– Bandwidth is the thing that you want to have the most of
– The best GPU in terms of cost/performance is the GTX 1070 (and soon also the GTX 1080 Ti)
– GPU memory size is important; but for many tasks 8GB is fine. If you want to computer vision research get a 12GB GPU
To answer other questions: CUDA core number and clock speed are not that important. Overclocking will give you almost no performance increase for deep learning.
Hope that helps!
Thank you very much for your answer. Your answer really helps me.
As far as I understood the 3GB model is really useless. So the 1060 6GB is fine for beginning and 1070 8GB is minimum for any real project and Titan X 12 GB is required for something real.
Cheers!
Titan X and GTX1080 ti have only 1GB difference in memory but in price big difference.
Does anyone know why?
http://www.eurogamer.net/articles/digitalfoundry-2017-gtx-1080-ti-finally-revealed
Yes it looks like Titan X and new GTX 1080Ti have basically the same specs, but almost half price for 1080:
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_10_series
I’d nearly ordered a titan x only to find them now out of stock in most retailers.
Is there something fundamentally different about the 1080 vs Titan where deep learning is concerned? Otherwise it looks like you could build a devbox clone for a decent price.
Definitely go for the GTX 1080 Ti. The 1 GB memory difference is not significant for most use-cases.
Should I take Zotac Nvidia GT730 cause I don’t have much money and can spend a max of 5000 INR. Any suggestions sir?-
The GT 730 variant with GDDR5 memory is a good choice in that price range. The DDR3 variant will be much slower so pay attention to the memory. The memory is just 1 GB in this variant, but if you use 16-bit networks you can do some experiments with this. If you need to train larger networks then the DDR3 variant with larger memory (up to 4GB) will be a good choice too. You will have to wait for experiments a bit longer, but you will be able to run most models if you use 16-bit and you will get a speedup over using the CPU.
Hi,
someone else above had a similar but not exact the same question, hence I would like to ask for your opinion as well 🙂
I understand that it would be optimal to have a CPU with enough native PCIe lanes to connect to every GPU with 16 lanes. Given that I would like to build a system with not more than two GPUs, I would need 32 lanes for the GPUs to avoid PCIe bottlenecks. Currently that yields to socket 2011-3 CPUs (Broadwell) with 40 lanes.
If I would, for reasons of cost, use a socket 1151 (Kaby Lake) setup with a 16 lane CPU but with a mainboard offering a PLX switch that can offer 2 PCIe x 16 slots, one questions arises: Do the GPUs need the whole PCIe bandwidth permanently, forcing the PLX switch to permanently share the 16 x bandwidth into 8 x / 8x or is it more likely that the GPUs transmit in an interleaving manner with full x 16 bandwidth available to the currently transmitting one. My guess is, that the truth would be something in between but I have no exact numbers or benchmarks. Do you have some experience regarding actual bandwidth loss or suggestions here, is it beneficial to use a PLX switch in 16 lane CPU, dual GPU configurations or should I definitely go for a 40 lane CPU?
Cheers,
Chris
Hi Chris,
there are some motherboard which support the 16x speed when no transfers to the other GPU is executed, but this is rare. In general you will have 8x / 8x speed. Check your motherboard specs for this.
I would not worry too much about PCIe lanes. If you want to parallelize GPUs it will be a performance hit, but you would still get good speedups. If you use good parallelization algorithms, like those provided by Microsoft’s CNTK, then you will have no performance hit. If you use the GPUs separately you will see almost no performance hit. So I would just go ahead with that setup. It will probably give you the best bang for the buck.
Please help
Do you recommend getting an Alienware amplifier with an Alienware laptop with a GTX 1060 for portability and the amplifier with a GTX 1080ti for the amplifier and a station
Please help
Hey Tim
Some of the links here direct to an old WordPress blog. Is that content unavailable now?
All of my content has been moved to this blog so you should find it here. I was not aware that there were some old dead links in this blog post. Thank you of making me aware of that. I will clean that up in the next days.
It is truly a nice and helpful piece of information. I’m satisfied that you just shared this useful information with us.
Please stay us informed like this. Thanks for sharing.
Thank you, I am happy that you found the blog post useful 🙂
Hi, i am into the deep learning and currently have k5100 Quadro gpu with 8gb of memory in a laptop with compute of 3.0. I want to make a solid DL rig which can serve me good for atleast 4-5 years with heavy work load. After reading the fantastic blog by Tim, i have selected the following using Pcpartpicker. As my budget limit is max around 2400-2500$, so basically i have gone for budget cpu but better gpus. I would go for one 1080ti and min one or max two 1070s.
Can anyone please look into my built and suggest any improvements. Also one thing i am confused about is whether i should go for
Asus X99-DELUXE II ATX LGA2011-3 Motherboard 394$
or
Asus X99-A/USB 3.1 ATX LGA2011-3 Motherboard $228.88
Does going for deluxe Mbo with almost 180$ more is justified?
Another thing i am confused about whether founder’s edition of the Gpus by EVGA or any other vendor is good enough or shud i go with customized with more fans but ofcourse will cost more.
My built is
Intel Xeon E5-1620 V4 3.5GHz Quad-Core Processor (40 lanes) 286.99
Cooler Master Hyper 212 EVO 82.9 CFM Sleeve Bearing CPU Cooler $24.88
Asus X99-A/USB 3.1 ATX LGA2011-3 Motherboard $228.88
Crucial Ballistix Sport LT 32GB (2 x 16GB) DDR4-2400 Memory $219.99
Western Digital BLACK SERIES 2TB 3.5″ 7200RPM Internal Hard $122.88
EVGA GeForce GTX 1070 8GB SC GAMING ACX 3.0 $374.00
EVGA GeForce GTX 1080 Ti 11GB Founder Edition $700.00
Corsair Air 540 ATX Mid Tower Case $119.98
EVGA SuperNOVA G2 1300W 80+ Gold Certified Fully-Modular ATX Power Supply
$182.03
Asus DRW-24B1ST/BLK/B/AS DVD/CD Writer
Total $2278
Thanks
Hi Sami,
I so do see why the $180 would be justified; the board adds 1 PCIe slot but give pretty much the same deep learning performance.
Please note that if you have two GPUs of different chipset for example a GTX 1070 and a GTX 1080 you will not be able to parallelize them.
Often the coolers on the GPUs are quite similar in performance so that it should not be a big deal. However, I am not so familiar with the current fan designs and there might be a fan which is superior to others. I probably would pay $20-30 if the fan performance is > 33% better, but not more. I do not think it is worth it at a certain point – better to save that money to buy another GPU in the future.
Hope this helps
Thanks for the useful info. I did’nt know abt the parallelization of gpu issue with different chipset. As these gpu are not cheap to get so i was thinking that i will use a 1080ti for big networks, while using one or two 1070s for small prototypes for params or checking different options in parallel on relatively small scale.Do you believe in my logic or should i prefer parallelization(by having same gpu’s, in which case i can afford max two 1080ti) over my current view??
My second question is regarding Asus motherboards, they all have very bad reviews on newegg.com by the verified owners due to dead mbo etc . As i am not based in US and getting these purchases through a US based friend so i cant avail warranty etc. Do you have any personal experience of a motherboard which u can recommend that can serve me 24/7 for long time, or can you suggest any particular brand, please.
Once again thanks for you info.
Ah I understand, using a GTX 1080 Ti and a GTX 1070 makes sense if you use them in that way.
There are in general no reliable motherboard manufacturers, but certain specific motherboard versions are more stable than others. I think to orient yourself along newegg reviews is a good idea. For example, I would not buy the motherboard that you linked due to their bad reviews. However, I am not current on the market situation for motherboards, so you have to find a good motherboard by yourself. Usually using pcpartpicker, selecting the X-SLI option where X is how many GPUs you want to have and then sort the price and pick the first option with good reviews is a sound strategy to find a good motherboard.
Thank you Tim. I have selected
EVGA X99 Classified 151-HE-E999-KR LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.0 Extended ATX Intel Motherboard. It has 5 x PCI Express 3.0 x16, 4 WAY SLI and great reviews on Amazon as well as newegg and is around 300$. I believe EVGA is a good brand and hope it serves me better for this. For the gpu, i will consider for similar chipset as you suggested bcoz in near furture, i guess all the libraries like theano, tensorflow, pytorch etc will have to support parallelization.
Thanks
Sami, FYI: https://www.amazon.com/gp/customer-reviews/R19IMBI0BXETDP/ref=cm_cr_arp_d_rvw_ttl?ie=UTF8&ASIN=B00MY3SQ84
@Michael. Oh really. I am tired at searching for reviews and looking for things. Can you please look at my built and suggest any improvements and especially some mbo for max 300$ or give me some direction. Thank you.
Sami, for my last 3 workstations, I didn’t bother building them myself. I sent the desired specs to several system builders, and then negotiated the price down. In the end, I only paid a few hundred bucks more than what it would cost me to do it myself.
For your budget, I would buy a used computer, 2-3 generations old, and get a couple of 1080 Ti cards.
Thank you for the info. Regards
Thank you for info. EVGA informed through email that the motherboard has been tested for Intel Xeon E5-1620 V3 and not the v4.0. So thanks to you for informing me abt it and i have changed the cpu from v4 to v3, both are almost same. Here is the reply from EVGA:-
“Hello,
Thank you for the email so, unfortunately, EVGA hasn’t tested the newer Xeon CPU like the V4. The only tested is the V3 these are only we have tested that has supported the X99 motherboard with the latest bios update. I apologize for the inconvenience.
Xeon® E5-1680 V3 3.20 GHz 40 1.14
Xeon® E5-1660 V3 3.00 GHz 40 1.14
Xeon® E5-2695 V3 2.30 GHz 40 1.14
Xeon® E5-2697 V3 2.60 GHz 40 1.14
Xeon® E5-2670 V3 2.30 GHz 40 1.14
Xeon® E5-2660 V3 2.60 GHz 40 1.14
Xeon® E5-2687W V3 3.10 GHz 40 1.14
Xeon® E5-2687W V3 3.10 GHz 40 1.14
Xeon® E5-2685W V3 2.60 GHz 40 1.14
Xeon® E5-1650 V3 3.50 GHz 40 1.14
Xeon® E5-2667 V3 3.20 GHz 40 1.14
Xeon® E5-2630L V3 1.80 GHz 40 1.14
Xeon® E5-2609 V3 1.90 GHz 40 1.14
Xeon® E5-2609 V3 1.90 GHz 40 1.14
Xeon® E5-1620 V3 3.50 GHz 40 1.14
Xeon® E5-2643 V3 3.40 GHz 40 1.14
Xeon® E5-1630 V3 3.70 GHz 40 1.14
Xeon® E5-2603 V3 1.60 GHz 40 1.14
Xeon® E5-2620 V3 2.40 GHz 40 1.14
Xeon® E5-2640 V3 2.60 GHz 40 1.14
Xeon® E5-2623 V3 3.00 GHz 40 1.14
Xeon® E5-2637 V3 3.50 GHz 40 1.14
Regards,
EVGA”
Can anyone please look at my almost final rig, and suggest any improvements or inform about any blunder which i am about to make, please, and especially about any useless money spend which i am already very short of having. The only aim is to have a solid reliable rig that can serve 24/7 for long time for around 2500-2600$. Thank you very much. Regards.
https://pcpartpicker.com/list/BqRgHN
Maybe you should consider one or two additional case fans? IIRC the power supply fan pushes the air out. If you add another out-blowing fan somewhere at the top and an aspirating one at the bottom this might improve heat dissipation in 24/7 operation.
Does CPU have 40 lanes?
How many max GPU it can support?
A CPU can have between 16 and 40 lanes. Read the specifications of a CPU to see how many lanes that CPU has. Usually you will need at least 8 lanes for a single GPU, but this is dependent on your motherboard. The CPU can provide support for lanes, but they must be there on the motherboard. A CPU can support a maximum of 4 GPUs.
Would it be possible for a single CPU (assuming enough dual-threaded cores) to handle 8 GPUs via a 96 lanes PCIe switch?
Another great thank-you from my side as I (hope I) have gained a lot of insight into the hardware setup of deep learning workstations!
Presently I am in the position of defining a deep learning workstation for internal research purposes in a professional environment (i.e.: full-grown Windows IT environment :-/ with a dedicated server room, requirements to use only certified hardware and such). From the pure deep learning research approach I would have opted for a system with 3 or 4 GTX 1080TI cards – well aware of the problem of parallelization, but at least providing ample computing power for independent parallel jobs to come to their ends in sensible absolute time slices. But the only certified offer in the strived-for power range we got was a system with 2 P6000 cards.
As this card is quite new and mentioned here only once and without further discussion w/r/to potential deep learning issues, let me please ask for hands-on experiences in the directions of libraries support, half precision deep learning computational power, and potential further quirks I may not have thought about? I am well aware that the P6000 is sub-optimal cost-wise, but anything not certified is a no-go in this environment. 🙁
Your hints are very appreciated!
The P6000 is based on the GP102 ship, which is very similar to a GTX 1080 Ti and Titan X Pascal. The features and performance will be similar to these cards, that is, you will have usual support of all deep learning libraries, good computational power, but almost no half-precision performance. So with that card you will receive a powerful GPU which you can use in your certified environment. If the cost difference between the P6000 and P100 is slim, you might want to opt for the P100 with which you gain a bit of performance and half-precision computation. However, if the difference is larger then just go with the P6000.
I’d definitely recommend going with Quadro GP100 or Tesla P100:
https://exxactcorp.com/index.php/product/prod_detail/2048
https://exxactcorp.com/index.php/product/prod_detail/1662
P100 should provide double the performance of P6000 for deep learning, with effectively the same or more of half precision memory (24GB or 32GB).
Talk to Exxact folks, I had a very positive experience with them.
Tim and Michael,
thanks for your comments! Pitily the cost difference seems to be anything but “slim”, so I expect the final system to contain two P6000, at least for starters. The Exxact product line will certainly be compared to local offerings.
Since we are now more than 2 years down the line, and Moore’s law has been doing its thing, I would be curious about an update to this great piece with the current HW (e.g. multi-GTX 1080 Ti’s).
The general hardware recommendations did not change very much and I think I would make the same recommendations that are listed here. If you are interesting in GPU recommendations you can read my other blog post about GPUs.
Hi Tim,
I’m reading through each and every single article of yours, they are to-the-point and very helpful for beginners in neural networks, like myself.
Regarding PCIe 3.0, I’ve noticed that most consumer-grade motherboards fall back to x8x8x16 for three GPUs and x8x8x8x8 for four GPUs. Hence, in a multi-GPU setup, where a different CPU thread handles each GPU, it’s not possible make use of the GPUs’ x16 lanes capability. Notably, PCIe 3.0 x8 has the same theoretical throughput as PCIe 2.0 x16.
While searching for motherboards with more PCIe lanes, I noticed that some new consumer-targetted motherboards come with a PEX 8747 Broadcom PCIe Bridge. That’s a 48 lanes bridge, which is still insufficient for non-synchronised, concurrent data transfers. Broadcom’s top of the line bridge supports 96 lanes (no idea how this solution costs): 64 lanes could be used for 4 GPUs and 32 additional lanes for the CPU, which means GPUs can communicate with each other using the full PCIe 3.0 x16 bandwidth and up to two GPUs can concurrently transfer data from/to the CPU at full bandwidth.
Have you considered these solutions? Are you aware of motherboards that deliver sufficiently good value for money, e.g. achieving performance that would costs a less than alternative solutions, to justify the cost?
Thanks in advance.
Hi Nikos,
thanks for your comment. I also stumbled upon these switches, but in the end they are probably not so suitable for deep learning. The details are a bit difficult to understand but let me try to explain: The problem with these solutions is that they still use the underlying PCIe interface and thus are limited just like normal PCIe transfers. In most graphics applications you do not have parallel GPU-to-GPU transfers, but GPU-to-GPU transfers which are slightly off-set in time and also small in size. Under such circumstances you can have clever protocols and extra lanes which feed into the usually attached lanes (which have a hard limit to 16 per GPU) in a safe in secure manner without blocking the channels. In other words, with these switches you can send multiple packets asynchronously and securely but each GPU still receives one packet at a time; in a normal switch each packet must be scheduled after all other packets on that path have completed or otherwise one has insecure transfers (which can corrupt the data).
The crux is that in deep learning, or generally in computing, you do many parallel transfers at the very same time and usually these packets are large. With this new fancy switch you can start the transfer of the packets asynchronously but they will still block each other for access to the GPU (because it takes quite some time to send the full package). This means that instead of blocking before the transfer you now have blocking during the transfer. I am not sure about the performance in this case, but I could imagine that the performance is the same or even worse than with normal switches.
The reasoning behind these switches is that they trade the synchronization of the full PCIe path with the synchronization of sub-paths on the PCIe circuit (the sub-path to the GPU) which increases performance for many applications, especially graphics applications, but probably not for deep learning.
Hope this helps!
Hi Tim,
Thanks for your response, as always very informative.
I wrote to Broadcom, to ask for additional information regarding their particular PCIE switches and they came back with very interesting feedback.
My understanding from their response is that broadcom’s aforementioned 80- and 96- lanes switches should allow for more efficient GPU – to – GPU communication at full x16 lanes speed (per pair), compared to the x8 lanes currently supported in triple and quadruple GPU configurations via a 40lanes CPU.
However, they also implied that these [current generation] switches communicate with the cpu via a 16lanes connection, ie the cpu cannot establish two 16lanes connections with corresponding GPUs in parallel via the switch. Multicasting data wouldn’t be affected.
I’m looking into reconfirming this with Broadcom, as there might be a benefit in using a motherboard with these switches in multiGPU configs.
Kind regards,
Nikos
Reading through your response again, I realise that concurrent access of four GPUs from the CPU, for transmission of large amounts of data, via a PCIe switch that features 16lanes upstream, would result in half the bandwidth of a CPU’s 8lanes with each GPU.
Indeed, these switches can be very complicated and I am not sure about every detail. If you can gain a bit more insight these it would be great if you can share it here. Thanks!
Hi Tim,
Following up from my previous message, I have now confirmed with Broadcom that, due to a contraint imposed by the PCIe specification, its PCIe switches feature a total of 16 lanes on the upstream port, i.e. to the CPU.
The downstream ports are non blocking, i.e. when using a PCIe switch of 80 lanes (64 lanes for the GPUs + 16 lanes for the upstream connection to the CPU), pairs of GPUs can talk to each other directly, using 16 lanes per pair.
If we use 4 GPUs, a single GPU can broadcast data to the others at full x16 speed (versus x8 speed if they were attached directly to the CPU’s PCIe lanes).
Also, the CPU can broadcast data to all four GPUs using the full x16 throughput.
Two pairs of GPUs can exchange data at full x16 speed (again, versus x8 speed if they were attached directly to the CPU’s PCIe lanes).
The downside is lower concurrent (non-broadcasted) data throughput from the CPU to the GPUs, where 16 lanes will be shared, delivering the equivalent of x4 throghput per GPU (versus x8 speed, if …, as above).
Thus, it really depends on how much unicast data needs to be transferred -concurrently- between the CPU and the GPUs versus between the GPUs.
Depending on what proportion of the time it takes to train a deep neural network is spent on (a)unicasting data -concurrently- from the CPU to the GPUs , (b) broadcasting data from the CPU to the GPUs, (c) unicasting data between pairs of GPUs and (d) broadcasting data from one GPU to the rest, a multi-GPU might benefit from a PCIe switch (also called PLX).
My understanding is that a motherboard with such a switch costs £200-£300 more than normal motherboards.
What do you think?
That is a pretty good insight, thanks Nikos!
So one will get improved performance from such a system. However, for bigger systems it is common to pool the data on the CPU to perform more complicated, tree-like broadcasts through the network. I do not think such broadcasts are currently implemented for GPU memory (this feature might have been added since the last time I checked, which was more than a year ago). I think such a motherboard with the 64 GPU lanes would be optimal in a 4 GPU setup. For a multi-node setup it might still be helpful, but probably too expensive to justify the costs, and for big systems reliability is often more important than squeezing out the last bits of performance.
All of this also depends on the type of algorithm that one uses though, but it is good to know that these motherboards can improve performance! Thanks again!
Hi Tim,
Thanks for this great article, I also read your other one on GPU performance. I’m on a budget right now so planning on buying a GTX 1060 6G, with the intent on upgrading in the future.
In this post, you mention your computer should have at least equal RAM as your GPU. Does that mean it would make more sense for me to buy a 6G RAM computer to match my card size ? I was originally planning to get a 4G RAM computer. And in the future, if I get a 1080Ti with 12G, will I have to upgrade my computer to 12G RAM ?
Thanks
Charles
This requirement is not so strict; I should update my blog post on this. If you have 4GB RAM you will be able to work with most datasets if you stream your data, that is load them in small batches bit-by-bit. If you do this 4GB will even suffice for the GTX 1080Ti. You might run into some problems if you run very large RNNs, but this can be prevent with some code which initialize weight directly on the GPU rather than CPU. You might also run into problems when you preprocess data, but also this can be managed with some extra code. You should be fine with 4GB.
Hey Tim, Nice article. I have a question:
Lets suppose I train a model for a particular task (classify an image in class 1 or 2) on a particular kind of input (800×400 resolution), I want to choose a GPU with MINIMUM number of cores and power that would give me the result in under 100 mili seconds. How to estimate this without running the model on GPU? Is there a relation between no of cores of the GPU and the performance of a deep learning model?
Thanks a lot in advance.
The speed of the computational units between different GPUs of the same series are about the same (NVIDIA Titan Xp modules are not much faster than say, GTX 1060 modules), but the reason why bigger cards is faster that they just have more modules (called stream multiprocessors or SMs). If your model is computationally not intensive, then benchmarking some small GPUs and extrapolating the number of SMs might be a valid option to find which is the optimal GPU in this case.
For operations which saturate the GPU such as big matrix multiplications or, in general, convolution this is very difficult to estimate. It sounds like you want to reduce costs. A good way to do this is also through power efficiency and this is a very transparent option which can be easily optimized. It also sounds like you want to reduce latency — this is very difficult to test because computationally graphs differ too widely; the only option that I see is to find people that have these GPUs and let them run benchmarks on your model. Or otherwise, try to generalize existing benchmarks for your model.
Can you give me some insights as to how I can extrapolate the benchmarks. Let us assume I have a GPU of 4 SMs and 4GB Global Memory on Pascal architecture which gives me X mili second avg classification time. Theoretically, Can I expect timing of X/2 mili seconds with a GPU having 8 SMs and 8 GB Global memory.
In my experience performance does not varies in linear fashion, How to estimate timings while extrapolating and interpolating the GPU Specs(No. of SMs, Global memory, Memory bandwidth etc)?
Thanks
Adarsh, it’s hard to give you any advice, because you didn’t tell us anything about what you’re trying to do exactly: what is your accuracy target (e.g. on ImageNet)? What is your power budget?
People have run VGG and Inception on an Iphone 6s, with 150-300ms latency:
http://machinethink.net/blog/convolutional-neural-networks-on-the-iphone-with-vggnet/
If you have some embedded application in mind, then your best option is Jetson TX2 dev kit. It should definitely be able to run latest Imagenet networks under 100ms latency.
See this nice paper which tested the previous Jetson kit:
https://arxiv.org/abs/1605.07678
It also provides some insight into the relation of amount computation and accuracy.
Hello Tim.
Thank you very much for this blog. You gave me solid background for my understanding of dependency between hardware and deep learning.
I have a question about bus speed in CPU. Should that be a concern ? As you wrote the true bottleneck is between cpu and gpu, and as i understand the “Bus speed” which is at ark.intel white sheets refers to that connection.
I have to choose between E5-262x v4/3 or E5-16xx v4/3. The 262x family have bus speed set at 8 GT/s QPI while 16xx have 5 GT/s. (8 GT/s is what PCIe 3.0 offers) Besides the Scalability, clock frequency and memory bandwidth that is the difference between them, and the only one that matters in deep learning since all of them have clock speeds above 2GHz, memory bandwidth over 68 GB/s and I will not make use of Scalability.
This link refers to the possible comparison of those models: http://ark.intel.com/compare/92980,92986,92994,92987
Thank you in advance.
It is correct that this is the main bottleneck between the CPU and GPU, however a very tiny amount of time in spend between CPU-GPU interactions on a memory level compared to the actually GPU computation. It becomes more relevant if you have multiple GPUs, but for multiple GPUs the main bottlenecks are somewhere else. Currently a good CPU (in terms of bus speed) will improve your deep learning performance by about 0-1.5% compared to a “standard” one and I would not worry about it too much. I think all the CPUs that you linked are more than fine.
Hi!
First of all a big thanks! You’ve basically created the best resource for deep learning enthusiasts looking to build their own machine.
My computer is going to be situated close to my bed so one priority is noise. This brings me to the first issue of whether or not to get a reference (Founders Edition) 1080 Ti or not, as they generally seem to be louder than their OEM counterparts. There seems to be a debate around performance and quality between reference and OEM cards. From what I can gather, most people who are doing long-running computations rather than gaming, especially on multi GPU setups, favor reference cards due to their fan design which blows air out the back rather than just circulate air inside the case. I’m starting with one card and plan to add a second one later, and I doubt I’ll get more than 2 cards for a while.
For the CPU, I was first looking at Intel i7 6850k, which was the cheapest i7 I could find that supports 40 lanes. However, Intel Xeon E5-1620 V4 is almost half the price and also supports 40 lanes. Not sure if the faster i7 is worth the money here?
Lastly, I was thinking about getting a water cooler for the CPU. I’ve read mixed opinions about water cooling, but I reckon moving air outside of the case should be a good thing as it allows the GPUs to run at lower temperatures?
Here’s the build: https://pcpartpicker.com/list/gzfgxY
Any suggestions highly appreciated!
Hi Tim,
Thanks so much for writing all of this up, it’s very informative. I’m currently picking out parts for a DL machine, and I’m trying to figure out where I may have bottlenecks.
Your piece on DMI for ram to vram transfer is quite interesting. Most of what I’m reading emphasizes high pci-e bandwidth. I’m building a dual gpu system, and I’m wondering if I really need both gpus running at pci 3.0×16, or if x8 is fine for each? It sounds like the DMA bandwidth could be a problem. I couldn’t find much info on DMA related to specific chipsets, however you mentioned 12GB/s. Is this bandwidth the same for different chipsets (I’m comparing z270 to x99). If I’m mostly running independent models on each GPU, would I see much if any benefit to 2x pci-3 x16, or would that only really show a big benefit when running the gpu’s in parallel for a single model? Asynchronous mini-batch allocation is interesting, however I’m not sure if it’s integrated into all of the newer high-level DL frameworks…
Re: the DMA issue, intel’s new optane drives are routed through PCI, and they can be used as ram in addition to long term storage. Do you think that these can be used as a way around the DMA bottleneck??
If you use the right algorithms there will be almost no decrease in performance if you use x8 for each GPU. Even if you use the “wrong” algorithms, performance reduction should be minimal for most models since aggregated transfer-times for 2 GPUs are not that large. The costs increase dramatically as you add more GPUs though — for a 4 GPU system it is important that you are on PCIe 3.0 with at least 32 PCIe lanes from your CPU/motherboard.
I would not care too much about DMA. I suppose for most chipsets / CPU combos it is the same. It might differ a bit here and there, but the performance difference should be negligible. I recommend using PyTorch for parallelism if you have a two GPUs and if you have 4+ GPUs I recommend Microsoft’s CNTK.
Thanks! After days and days of research I just ordered my hardware.
I thought I’d mention, theres a motherboard that I feel is perfect for dual gpu rigs: The ASRock Z270 supercarrier:
http://www.asrock.com/MB/Intel/Z270%20SuperCarrier/index.asp
This board, like some of the x99 workstation boards has a PLX switch, allowing dual pci x16 or quad pci x8 on a Z270 board. For dual gpu rigs, you get the added benefit of being able to run 2 gpus 4 slots apart (instead of the usual 3 slots on most non-workstation boards). This helps a lot with cooling since there’s more space between the 2 gpus, especially with non-reference coolers taking up 2.5-3 slots these days.
Hi Sam,
What Processor are you using with ASRock Z270 supercarrier?
Intel i7-6850K Processor ???
Does this motherboard support 40 Lane ?
Thanks
Tom
Hey Tom,
I’m using a i7 7700k. The ASRock Z270 Supercarrier is a Z270 board, so it takes LGA1151 chips, which are limited to 16 lanes, therefore CPU’s for this board only have 16 lanes. Just like how some of the Asus/ASRock X99 workstation boards have dual PLX chips that allow 4x pci x16 with a 40 lane X99 cpu, the Z270 Supercarrier allows a cpu with 16 lanes to run 2 GPUs at pci x16. Of course it’s not that simple – I don’t think the performance is identical when running PLX chips to get “more” pci lanes than your CPU has, but my understanding is that the GPUs can communicate with each other at x16. Same reason why the Nvidia dev boxes use X99 chips with 40 lanes yet operate 4 cards at x16 through a workstation board with plx chips, the Supercarrier lets us run 2 cards at x16 with a 16 lane Z270 cpu. It’s much cheaper than a i7-6850k yet allows for similar GPU bandwidth. I think that Z270 also has some other nice things that X99 doesn’t because it’s much newer, although X99 does support some things that Z270 doesn’t like quad channel memory. Lastly, having a board with 4 slots between GPU’s is nicer for SLI when you’re air cooling.
Dual PCIe x16 should not need any PLX switches. The switch is only needed when you want to do Quad PCIe x16, which is more lanes than a single CPU can support.
Dual PCIe x16 doesn’t need switches when using a 40 lane CPU, however having a switch on a Z270 board allows me to use a much cheaper, still very powerful 16 lane CPU with 2 GPUs at x16.
Non-X99 motherboards and a normal processor which does not have 40 lanes can not run 2 GPU with the full power which is 16X, is that correct?
So in my understanding ASRock Z270 supercarrier is perfect becase it allows having 2 GPU using 16x and 3 M.2.
99.9% motherboard does not space to put 4 GPU (Overclocked ones) without water cooling installed. 4 Founder edition GPU can be installed in motherboard like “ASUS LGA2011-v3 Dual 10G LAN 4-Way GPU ATX/CEB Motherboard (X99-E-10G WS)” which provide full 16x utalization.
Hi sam,
Can we use Z270 board + Intel Boxed Core i7-6850K Processor together ?
Thanks, Sam.
i7-7700k costs over $300. That’s not cheap. For that kind of money you can get a CPU with 40 lanes (e.g. E5-1620v4), and put it into something like ASRock X99 Extreme4 board. Or you could pay more for Asus X99-WS board which has 2 PLX switches and supports quad PCIe x16.
Thanks, Michael.
That’s true, although I’d like to have something newer/faster than ivy bridge as I do use this machine for more than just deep learning. If I really wanted to save money I could use the supercarrier board with a ~$40 kaby lake G3930.
E5-1620v4 is Broadwell, this is the latest generation of Xeon architecture. It has much better memory bandwidth than 7700k.
Ahh my mistake! And you’re correct about the memory bandwidth, although how relevant is that considering the DMA bandwidth bottleneck for CPU memory –> GPU memory transfer?
With a 40 lanes cpu you can have the CPU concurrently exchanging data with 2 GPUs at full bandwidth.
With a x16 lanes CPU you’re limited to either x16 lanes to one GPU at a time, or x8 lanes for concurrent exchange of data with both GPUs.
As Tim mentioned earlier, for 2 GPUs you’re fine with x8 per GPU. Having said that, 16 lanes on the cpu may not be sufficient, as some of these lanes maybe reserved by the chipset or other PCIe devices, eg integrated M.2 slot.
I would opt for a cpu with more PCIe lanes, as Tim and Michael have advised.
What is the best way to put 4 GPUs (NON-founder edition) easily in a board?
Thanks
So I’ve purchased a ryzen 1700x and a msi x370 sli plus
I am wondering if it is going to bottleneck 2x 1080ti for deep learning?
ryzen only has 24 pcie lanes
the board only supports
x8/x8 in pci 3.0
if i understand correctly that is almost the same as x16 pci 2.0
Is x16 pcie 2.0 single gpu a bottleneck?
is x8/x8 pcie 3.0 going to bottleneck 2 gpus?
a 1080ti has 11gbps ram
but x8 pcie 3.0 would be only around 8gbps while x16 3.0 is around 16gbps
Can a modern gpu stream in new data while it is doing calculations?
in which case some of the 11gbps is being used for compute and some of it is being used to stream new data in?
Bit of a deep learning rookie here.
Grateful for any advice.
Also will the dual channel ram in ryzen be a problem?
Should i build a threadripper machine?
As of now I don’t plan to use 4 gpus.
So as long as my current build doesn’t have a major problem i will use it with 2 gpus
and then next year do a 4 gpu build with canonlake/10nm
It depends on the algorithm but in general PCIe lanes with 2 GPUs are not that important. If will decrease performance but not by a lot. Maybe 0-10% depending on the use-case. You should be fine.
First of all, thank you very much for how comprehensive and deep knowledge you gave us through the two blogs: the full and the GPU focused.
I wish that you could answer my question, with advance apologies if my question asking the obvious.
I was about to spend around £3800 on a PC (the new ALIENWARE AURORA) which has two GeForce GTX 1080 Ti, 64GB DDR4 at 2400MHz, and Intel Core i7-7700K Processo. I was very happy that I finally could decide which PC I should buy for my PhD research the next two years. What made me more happy that I was following your appreciated GPU-focused blog – YES! I have multiple high performance GPUs!
However, something took me to the other blog – this blog – and I read the CPU advice ending with the fact that my CPU is only 16 PCIe lanes – not 40 as you warned. I went back the first step in my searching for a PC, before 4 months.
I did my best again, [focusing only on built PCs by Dell or Lenovo], and I ended with another ALIENWARE PC – the ALIENWARE AREA-51, which has same* GPUs and Memory in the first PC in my comment, however it has different CPU which is i7-6850K with 40 PCIe lanes and 3.0 ER. However, the cost went up by more £700: £4500. It is expensive, I could and will afford it for my PhD, but it is expensive.
When I reached such cost, I remembered two Laptops from which I ran away because their costs. I said to myself, if I reached £4500 with the PC, why not go with the life laptop with one or two more thousands. The laptops are:
(A) ThinkPad P71.
– CPU: Intel Xeon E3-1535M v6 (certainly, back to 16 PCIe lanes).
– Memory: 64GB(16×4) DDR4 2400MHz ECC SoDIMM
– GPU: NVIDIA Quadro P5000 16GB (no two GPUs).
– SSD: 1TB
-HDD: 1TB
-Cost: £6200
(B) Dell Precision 7720:
– CPU: Intel Xeon E3-1535M v6 (certainly, back to 16 PCIe lanes).
– Memory: 64GB(16×4) DDR4 2400MHz ECC SoDIMM
– GPU: NVIDIA Quadro P5000 16GB (no two GPUs).
– SSD: 1TB
-HDD: 2TB
-Cost: £5800
So, my choices are as following:
Choice One: new ALIENWARE AURORA as PC + XPS 15 £1800 as Laptop = £5400
Choice Two: ALIENWARE AREA-51 + my very old crying coughing laptop = £4500
Choice Three: no PC + ThinkPad P71 = £6200
Choice Four: no PC + Dell Precision 7720 = £5800
If you could please, and really I am so sorry to have you and your appreciated time reading this long comment, help me with selecting one choice or arranging them with your reasons, you will make my next two years technically truly safe. I have to say that my research is on two different data spaces: genetic data and textual data.
Finally, thank you again for your contribution through this blog, and thank you in advance for getting this point reading my comment.
*To be fair regarding the cost of the second PC, it has 2 more TB HDD [4TB] than the first PC, however it provides the same size of SSD: 512GB.
These are all solid options albeit all quite expensive. Note that PCIe lanes are not that important if you have 2 GPUs, but become more important if you have 4 GPUs. However, I do think the biggest issues here is just that these computers are too expensive. If I were you I would go for a used computer solution which I would upgrade to your needs.
For example, I just last week sold my used computer, which is similar, or even better than these options for 800 pounds on gumtree. So a smart choice might be to buy a used computer and upgrade it with some parts. For genetics research I would try to find a cheap computer which support 8 RAM slots and than buy 64 GB RAM for the machine and upgrade to 128 GB of RAM if your research requires this. Speed of the RAM is overvalued; a plain DDR3 RAM setup is sufficient and cheap. For some deep learning algorithms or algorithms in computational biology a single GPU should be sufficient but choose one that has a lot of RAM; a 12GB is ideal and I would go for a used GTX Titan X for 400-500 pounds on eBay (make sure your computer has a PSU which at least support 600 watts).
This option would yield a very high performance computer for roughly 2000 pounds. Of course it requires some manual assembly, but it really is not difficult and you really should try to do this.
If you cannot get a used option with parts due to university bureaucracy I would go with a ordinary laptop + a hetzner.de GPU machine which for a 3 year PhD will cost 4400 pounds but offers everything that you need and can be canceled / upgraded month-wise. For most genetics research you should be fine without a GPU which would cost 2150 pounds on hetzner.de. If your algorithms require double precision then you will need to make a careful choice about which GPU to get, but probably the most cost efficient solution would involve renting some Tesla GPU in the cloud (AWS for example) to work with double precision when you need it.
So the main options that I see are (1) buying used computer and upgrade its parts, (2) buy ordinary laptop and a dedicated machine in the cloud. These options will give you the best performance per quid.
Don’t know how to thank you. Your generosity representing in your reading time and response is profoundly appreciated.
According to must limitations in buying ‘new’ and ‘high performance’ PC or laptop, I will take your appreciated advice for the future. Now, I believe that I will go with the first choice in my list hoping that two GPUs will be enough most of the time.
Again, thank you very much.
Hi Tim,
many many thanks for your great blog articles, they are a great help!
I have a perhaps a bit off-topic question. Can you recommend any resources to learn about computer hardware on a conceptual level? So I am not really interested in the underlying electrical engineering just yet, but about different components and how they interact. For example I’m interested in how data is being transferred from memory to GPU-memory in more detail.
Thank you again,
Felix
That is a good question, but unfortunately, I do not have a good answer for that! I also wanted to learn more the conceptual side of hardware, but the resources that I found are often resources from universities and textbooks which also look at the details. What I found most promising was to just do google searches for specific questions and try to get informed through multiple sources of websites. For example googling “cpu to gpu memory transfer” will yield blog posts, forum questions, presentations on the topic and so forth. With that, you can get informed about that question. From here you might have new questions which you can then google. If you do this for a few hours every week, you will get quite knowledgeable about concepts quite quickly. Hope this helps!
Tim
Thanks, I’ll do that then.
I guess I was intially hoping that there was nice resource to simplify the learning process 😉
Hi Tim,
Thanks for the thorough write-up, it is truly helpful.
I am in the process of picking parts for a deep learning machine myself, and I have a focus on graph computation and network analysis.
Do you have any insights on whether an i5 7500 (3.5 ghz) with 1060 6GB and 32 RAM would do (in an mini-ITX setup), or should I go for an Xeon E3-1225 (3 ghz) with the same GPU but possibly more RAM (64 GB)?
Thanks for your efforts to share knowledge so far!
Mirela
Just a small update on my part:
I assume working with node2vec would fit my context best.
(https://snap.stanford.edu/node2vec/)
Also, I’ve found
https://www.google.nl/url…gL8EYt_ZHetggs1UnH7HU14uA
(link to pdf, via CWI)
(perhaps interesting for you as well of course)
And upon long pondering, I assume the Xeon E5 1620 v4 is a wiser choice compared to an i5/i7 setup.
Xeon is mentioned here, as well as is graph processing for a similar setup:
https://www.youtube.com/watch?v=875NbdL39A0&feature=youtu.be&t=243
+
https://www.youtube.com/watch?v=875NbdL39A0&feature=youtu.be&t=445
I’ve already invested in a ‘good’ (what budget could hold 🙂 GPU, namely the gtx 1060 6gb, and ram would be 16 or 32 gb as well (already useful for R).
But now it seems the Xeon would be the best option.
Regardless, many thanks!
That sounds reasonable. If I were you I would also pay good attention to the motherboard. If it has extra RAM slots (8 slots) then you can always increase the RAM size if you need more; in that way you can upgrade your setup depending on the problem that you are working on.
You might want to go with the 64GB setup depending on what kind of graphs you will work. The graph structure can differ greatly and some graphs will require you to have more than 100GB of RAM while for others it is more manageable. The CPU if often less important (but still depends on graph and problem, so check this for problems/graphs you work with). A GTX 1060 might be a bit slow at times, but often you do not work with the full graphs anyways because training would take too long. Thus you could also trim down your graph further and then a GTX 1060 is a solid choice (no large memory required and good speedup over the CPU).
Hi Tim,
Many thanks!
I have bought the components for below listed setup, aiming at having as much RAM as possible (‘affordable’ :).
– intel xeon e5 1620 v4
– supermicro x10 srl-f
– kensington ddr4 32gb RAM (lrdimm) 2133 mhz
-> I plan to have in total 8 modules of these, in total 256 gb
(it’s even possible to go up 375 based on the cpu, and 1 tb based on the motherboard)
– 256 gb SSD
– 1 tb HDD
– msi geforce gtx 1060 6gb
– noctua cooling
– lepa power, 800W
Have seen https://event.cwi.nl/grades/2016/00-Leskovec-slides.pdf also, seems like RAM is very relevant indeed 🙂
Looking forward to some graph processing!
And thanks again!
Hi Tim,
Thanks for your great blog.
Can i use “GeForce GTX 1080 Max-Q” laptop for deep learning task?
Here is the full description. I need something really portable but same time i need to be able to train RNN models.
HIDevolution Asus ROG Zephyrus GX501VI-XS74-HID1 Black 15.6″ w/ IC Diamond Thermal Compound on CPU+GPU – Optimal System Temperatures (FHD/i7-7700HQ/GTX1080 Max-Q/512G PCIe SSD/16GB RAM)
https://www.amazon.com/HIDevolution-Zephyrus-GX501VI-XS74-HID3-Diamond-Compound/dp/B0736C1PP5/ref=sr_1_7?ie=UTF8&qid=1498792741&sr=8-7&keywords=gx501vi&th=1
The GPU in that laptop is quite powerful so you will be able to train RNNs without any major problems. It also should be quite fast compared to, say, a GTX 1060 which will be quite a bit slower.
Cuda core is fine but apart from that everything else is 30% less compare to main GTX 1080.
Will there be any performance issue during training large Model ?
You can expect the card to be about 30% slower, but that is still pretty fast compared to other cards. You might need to adapt your models slightly or use 16-bit precision for very large models, but you should be able to run everything that is out there.
Hi Michael /Tim
I am looking for one deep learning PC and I found this “Intel Core i7-7800X Processor”
with
Socket LGA 2066
Compatibale with Intel® X299 Chipset
6 Cores/12 Threads
Max Number of PCI Express Lanes 28
Intel® Optane™ memory ready and support for Intel® Optane™ SSDs
AND
MSI Performance Gaming Intel X299 LGA 2066 DDR4 USB 3.1 SLI ATX Motherboard (X299 GAMING PRO CARBON AC)
Is this good choice for 2 1080Ti with full speed?
Thanks
Johydeep
It looks reasonable. With 28 lanes you will have a bit slower parallelism, but for 2 GPUs this bottleneck is not too large so you should still be fine; I guess you could expect a performance decrease of 10-15% for parallelism with 2 GPUs, which is okay. Otherwise, the specs are quite good for general computation, so if you want to use your CPU for other data science tasks this is a good choice. If you want to only do deep learning I might go for a slower CPU which has more lanes, but your current option is also not too bad.
Hi Tim,
thanks for this great guide!
It helped us to choose a deep learning server for tensorflow. We now use this rack machine https://www.cadnetwork.de/de/produkte/deep-learning but with four Tesla P100 instead of GTX 1080 Ti. But i don’t know if there is a huge difference between GTX and Tesla.
I can confirm that 1-3 GPUs are used fully and the fourth GPU deliver about 40% of their performance. It could be a limitation of the PCIe Bus.
Thanks
Thorsten
Hi Thorsten,
that is interesting. I do not think that the 40% performance comes from PCIe issues alone, there might be another thing amiss. It cannot be some cooling issue since then you would see a performance degradation with other GPUs too. It would be interesting to know the reason for this. Let me know if you know more!
I am happy that my guide helped you to choose your server! Indeed, Tesla GPUs are only minimally better than GTX GPUs. The P100 is quite a bit better than the GTX 1080 Ti, but it also costs un-proportionally more. I think GTX 1080 Ti would have been more cost effective, but often these are not available for servers (NVIDIA has the policy to sell GTX cards only to consumers and Tesla cards to companies), so overall not a bad choice!
Hi Tim, I have a PhD in Computer Science but I have not worked on DL before. For CPU, do you recommend the AMD Threadripper, Xeon or Core i7-7700/7700K? I plan to buy a 1080 Ti first and if needed, add more later.
Any of the CPUs that you listed is fine for deep learning with multiple GTX 1080 Tis. Choose the CPU according to your additional needs (preprocessing, other data science applications, other uses for your computer etc).
Thanks Tim. As I know, software for my other needs do not take advantage of multi-core. So, faster CPU is better than having more cores. Do software related to deep learning take advantage of multi-core, multi-thread? If so, about how many cores and threads of CPU would be advantageous? AMD and Intel have different system/memory bandwidth. Which would be better?
Most deep learning libraries make use of a single core or do not use other cores in full. Thus CPU with many cores does not have a great advantage over others.
Thanks Tim. Regard to the GTX 1080 Ti, there are several companies selling cards of different variants using the GTX 1080 Ti, which brand and variant do you recommend? I plan to buy one card first and if needed, add more later.
Somewhere I read a recommendation to stay away from the reference edition which I read is also called the Founding edition. Do overclocked 1080 Ti performance significantly better? In the past, overclocked systems tended to fail sooner. Not sure if it is worth.
I would recommend the cheapest card. The cards are almost the same. Overclocked cards have almost no benefit for deep learning (for gaming they do, though). I am not sure about the Founding edition — have not heard anything bad about it other than other cards being cheaper.
Thanks Tim. How the number of GPU card scales with the performance? For example if I have X*1080 Ti installed on the same computer, will it take 1/X the time to complete the same task?
Scaling within one computer is usually quite good. It still depends on the task, but you can expect a scaling from 2.5-3.9 for 4 GPUs depending on the software framework. The main drawback is that you have to add more special code with handles the parallelism. I recommend PyTorch for these kinds of tasks.
Hi Tim,
I followed your guide to understand better my needs in the computer I want to build for deep learning applications. However I have a question regarding the PCIes from the CPU.
Specifically, you mention that 40 PCIe are good to go for 4 GPUs, and also mentioned that every GPU communicates through 16 PCIe. In my mind if I would like to use the full potential of the GPUs I calculate I would need 16×4 = 64 PCIe in my cpu to make this communication efficient. I defintely misunderstood something about that but I would love to know how did you came to this conclusion. So the question basically is, how many PCIe does a CPU need? do I need more than the ones that my GPUs demand? Is there any other component demanding this buses and therefore necessary to have even more?
Thanks in advance for the information.
Best Regards,
Fernando
Generally the more the better and while PCIe speed is not that important if you only do parallelism among 4 GPUs it is still the easiest factor to improve (or decrease if you do not have the lanes) your performance, Generally only devices that are attached to the PCIe bus also draw lanes. For example, if you have a PCIe SSD, this will also affect the transfer speed to your GPUs. The setup in which your PCIe devices can run is specified by the motherboard. For example you might have a 40 lane CPU, but your motherboard only supports a 8x/8x/8x/8x setup for your PCIe devices, in this case GPUs, so that so GPU can utilize the full 16x speed.
In my research on GPU parallelism, it is usually the case that networking performance is the greatest bottleneck. So if you have few lanes, your algorithms are limited by how much they can scale. There are algorithms which go around this, but currently only Microsoft CNTK supports those algorithms (block momentum parallelism). So in general, having full 40 lanes (or rather 36 because one GPU with 16x speed is useless for deep learning if the other GPUs do not have 16x) is a good thing to have. On the other hand, such CPUs and motherboard that support 40 lanes are more expensive. From a cost efficiency perspective it might be better to go with fewer lanes — you might just get more bang for the buck.
The details are more complicated., but I hope this helps you to get an overview about the issue.
Fernando,
Also check if a PCIe switch (PLX) makes sense for the types of workloads you will be creating.
Some motherboards feature PLX chips, to allow for 4 GPUs operating at 16 lanes each. Typically, a PLX chip connects to the CPU using 16 lanes and handles 2 GPUs using 32 lanes. The paired cards can communicate with each other (DMA) via the PLX chip, i.e without consuming any PCIe lanes on the CPU and without being affected by any communication of the other pair of GPUs.
Also, note that PCIe uses separate lanes for downlink and separate for uplink, i.e a device that supports 16 lanes, practically supports 16 lanes uplink and 16 lanes downlink, which can be used concurrently at full speed. This is beneficial, when the software library uses the following approach:
If the workload can be split into four processing stages that take about the same processing time and each stage can be handled by a separate GPU, here’s how the data would be transferred at full speed: GPU1, GPU2 are attached to PLX1 and GPU3, GPU4 are attached to PLX2. The CPU uses 16 (uplink) lanes to send data to GPU1 via PLX1. At the same time (in parallel), GPU1 transfers the data it has just processed to GPU2 using 16 lanes via PLX1, GPU2 transfers data it has processed to the CPU using 16 (downlink) lanes via PLX1, the CPU transfers this data to GPU3 using 16 (uplink) lanes via PLX2 and, similarly, GPU3 transfers data it has processed to GPU4 using 16 lanes via PLX2; GPU4 transfers the data it has processed back to the CPU using 16 (downlink) lanes via PLX2. You’ll notice that the GPUs make use of all 16 PCIe lanes available to each of them and the CPU also makes full use of 32 lanes (on both directions, up link and downlink).
In other words, your software can potentially make optimal use of 16-lane GPUs, via a CPU with 32 available PCIe lanes, if it only needs to send data from the CPU to the first GPU and receive data concurrently from the last GPU (in a sequence of GPUs, where each GPU does some processing and forwards the data to the next, for further processing) back to the CPU. The workload needs to be balanced, so that GPUs don’t wait too long.
I’m aware of two EATX motherboards with this feature and they are quite expensive and I’m not sure if the additional cost can be justified in terms of performance:
ASRock X99 WS-E/10G
ASUS X99-E WS
Regards
When looking for a motherboard, do I need to ensure something like: NVIDIA Quad SLI, 4-Way SLI, 3-Way SLI, SLI Technology
if I plan to use more than one 1080Ti in the same machine?
You will not be using SLI for deep learning.
I suggest that you ensure the motherboard has enough PCI-E x16 slots for future expansion (as Tim has advised) and, if you are concerned about the number of lanes that will be available in a multi-GPU setup, you will need to download the manual (in pdf) of the motherboard you’re interested to buy and check the number of lanes according to the number of GPUs.
Without a PLX chip, depending on CPU lanes, the manual could say for example 2 GPUs at 16/16, 3 GPUs at 16/8/8, 4 GPUs at 8/8/8/8. A motherboard with PLX typically says 2 GPUs at 16/16, 3 GPUs at 16/16/16, 4 GPUs at 16/16/16/16.
Thanks. I heard that although the Threadrippers have lower CPU clock than the best Intel CPU, they support 64 PCI-E lanes and Quad-Channel DDR4 memory. Does that mean the TR4 socket motherboards would support running 4 Nvidia GPU at the same time?
About using more than one GPU for DL, it seems that I need to write software to take advantage of parallelism. Isn’t the use of multiple GPU to solve problems automatic? I mean when more than one GPU is installed, the hardware and software (e.g. tensorflow) automatically detect the existence of multiple GPUs and divide the task to all the installed GPU automatically.
Indeed, Ryzen Threadripper comes with 64 lanes but my understanding is that some of these are reserved for other motherboard features, eg. chipset, M. 2 slots, etc. Check the following, as an example :
“4 x PCIe 3.0 x16 (single@x16, dual@x16/x16, triple@x16/x16/x8 mode)”
Source : https://www.asus.com/Motherboards/ROG-Strix-X399-E-Gaming/specifications/
It advertises 4 x PCIe 3.0 x16 but then explains that the third GPU will be operating in x8 mode. Check the detailed specs in the manual before you buy a motherboard.
More lanes on the CPU is a good thing, but I suspect a single threadripper will support up to 3 GPUs at x16 and the server equivalent will support up to 6 GPUs at x16.
Parallelism is implemented differently in each library. Read through Tim’s articles, he has provided some very helpful info in his blog. Also, check each library for updates, as they are gradually improved by their respective authors.
I asked Asus. Their reply is:
” I understand that want to the specifications of ROG Zenith Extreme and ROG Rampage VI Extreme. I know that as a computer user you want to customize your motherboard on your own preference to use it to its full potential. Let me continue assisting you with your concern.
For the two motherboard, it can support 4 x PCIe 3.0 x16 (x16, x16/x16, x16/x0/x16/x8, or x16/x8/x8/x8) at the same time since they didn’t share any bandwidth with any of the slots in the motherboard.”
Does that mean for these two motherboards, I can use four 1080Ti GPUs running at top speed at the same time?
In another reply, support mentioned that even I added a SSD or other PCI cards, the four GPU can still run at top speed.
My understanding is that “4 x PCIe 3.0 x16 (x16, x16/x16, x16/x0/x16/x8, or x16/x8/x8/x8)” means:
1 GPU at x16, i.e at full speed
2 GPUs at x16/x16, ie. both cards at full speed
3 GPUs at x16/x16/x8, ie. two cards at full speed, the third at half speed,
4 GPUs at x16/x8/x8/x8, ie. one card at full speed, three others at half speed.
Hence, a motherboard with the aforementioned capability does not support more than 2 GPUs at full speed.
I suggest you go back to the manufacturer and ask them to clarify their position. You can ask them, for example: if you attach 3 or 4 GPU cards and each card supports x16 bidirectional PCIe lanes, will the particular motherboard (of interest) allocate, to each card, x16 dedicated bidirectional PCIe lanes, for concurrent transfer of data between the cards and the CPU?
If the motherboard supports 4 GPUs at full speed, it’s typically reported as x16/x16/x16/x16.
Thanks Nikolaos. I will ask support as you suggested.
I may be wrong but I get the impression that he might be trying to hide something. For example, in a previous email, he wrote as following.
Hmm… What is “For the full speed, it actually depends on how you will use the 4 ROG-STRIX-GT1080TI-11GB at the same time. “?
He also suggested getting an overclocked version of 1080 Ti. Does overclocked version perform noticeably better?
—————-
“The ROG Zenith Extreme again can work with 4 graphics card since it support multi-GPU and supports 4 way SLI Technology. For the full speed, it actually depends on how you will use the 4 ROG-STRIX-GT1080TI-11GB at the same time. I can still recommend the Zenith if you don’t want to overclock your GPU. Your GPU speed will not lower down even if you connect an SSD or another expansion card since there are no bandwidth between the PCIE slots. For Intel i9 processor I can suggest the ROG Rampage VI Extreme.”
Tim has an excellent article on GPUs, where he explains the pros and cons of different types of GPU coolers. I suggest that you read patiently Tim’s articles and the responses he has given to other readers. There’s wealth of information here:
http://timdettmers.com/2017/04/09/which-gpu-for-deep-learning/
Hi Tim, I’m new to Deep Learning and Computer Vision and I need to build a workstation for that within $1000 budget and I’ll considering used and low cost components available in Pakistan. So far I have found following options.
Board + Processor + Casing + Power Supply
Dell T3610 with Xeon E5 Series CPU (12 Core, 35M or 30M Cahce, 2.0 GHz) $570
Asus X99 Motherboard with Core i7-5820K (6 Core, 15M Cache, 3.3 GHz) $237 + $332 +$10 +$42= $621
MSi X99S SLi Plus with Xeon E5-2620 v4 (8 Core, 20M, 2.1 GHz) $801 + $20 + $42= $863
HP Tower z820 with Intel Xeon E5-2687W (8 Core, 20M, 3.1 GHz) $550
HP Tower z620 with Intel Xeon E5-2650 (8 Core, 20M, 2.0 GHZ) $431
GPUs
GTX 1050Ti – $185 – 2GB
GTX 1060 – $400
GTX 1070 – $512
GTX 1080 – $711
Other Options Include Quadro 5000 with 2.5 GB and 382 bit
RAM
16GB DDR4 $142
16 GB DDR3 $33
HDD
500 GB $19
1 TB $33
SSD
128 GB $28
Please guide me about the most powerful and lost cost combination that will help me in future. Also let me know if a better combination of motherboard and processor can be made from available parts.
You can save money by using the DDR3 ram option with a suitable motherboard. The cheap E5 options look quite good to me. I would go for a GTX 1070 given these prices. If you are short on money a GTX 1060 with 8GB of RAM would also be okay. Hope that helps!
I am not certain where you’re getting your info, however great
topic. I needs to spend some time finding out more or working out
more. Thanks for magnificent information I used to be in search of this information for
my mission.
I cannot find a motherboard that supports threadripper and 4 x PCIe 3.0 x16/x16/x16/x16. How come such motherboard is not available?
You never will. Threadripper have 64 PCIe lanes, but you have to left 4 of them for chipset and most mobo now will feed 4, 8 or 12 lanes to MNVe/SSD and other disks.
I checked NewEgg and indeed the X399 board’s specs show that they only support standard PCIe setups. However, if you look at the manufacturer’s homepage you will see that they indeed support full 64 PCIe lanes. I assume Newegg system is not updated yet to make a 16x/16x/16x/16x system available for the specs (it seems to be standardized). See for example https://www.gigabyte.com/Motherboard/X399-AORUS-Gaming-7-rev-10#kf
Thanks Tim. I am a bit confused. From the specifications, there is no mention that it supports 4×16, only 2×16. The online menu states teh same thing. Am I reading it wrong?
1. 2 x PCI Express x16 slots, running at x16 (PCIEX16_1, PCIEX16_2)
2. 2 x PCI Express x16 slots, running at x8 (PCIEX8_1, PCIEX8_2)
(The PCIEX16 and PCIEX8 slots conform to PCI Express 3.0 standard.)
3. 1 x PCI Express x16 slot, running at x4 (PCIEX4)
(The PCIEX4 slot conforms to PCI Express 2.0 standard.)
You are right, I just for searching “lanes” and confused the 64 that I saw with specs for the motherboard. This is strange indeed, why do they not support the full 64 lanes? There was another blog post saying that this particular board would support that, but the manufacturer’s page clearly says it does not. I would get in touch with any manufacturer and just ask.
As far as I know, even threadripper supports 64 lanes, none of the available motherboard allows 4 CPU running at 16x16x16x16 at the same time. So, if I want more than two GPU running at x16, I will need to choose Intel CPU?
The following article suggests that only 56 out of 64 PCI-E lanes can be used for GPUs:
http://www.guru3d.com/articles-pages/amd-ryzen-threadripper-1920x-review,4.html
I haven’t found a good explanation but I think it’s likely 4 lanes are used to connect the cpu to the X399 chipset.
PCIE switches could be used (in future workstation motherboards) to support more (than 3) GPUs at x16 lanes.
Both Intel and AMD announced an overwhelming number of CPUs in August. Which CPU choice would be the best for ML/DL? At first, I considered the Threadripper but there is no related motherboard that supports the running of 4 GPU at x16x16x16x16 at the same time. I probably get two 1080Ti but I may need four later. Is there an advantageous in getting dual CPUs motherboard vs. getting 2 computers?
I read that CPU is not as important as GPU for DL. Just to make sure the number of CPU cores is 2x the number of GPU. However, I also read that CPU cores could be assigned to take part of ML/DL computation. Do, does that mean it is good to have as many cores as I could get?
More cores are always better, but it is also a question of how much you want to pay. I think CPU cores = 2x GPU might be a bit much for the high range. If you get 3 GPUs a 4 core is still sufficient. If you have 4 GPUs I a 6 core would also be sufficient. I would however not recommend a 2 core for 3 GPUs. 4 cores for a 4 GPU system is borderline, as it will be okay if you just run deep learning but it might become a bottleneck as you run any other application in addition. So choose according to your budget and according to your needs.
Thanks. Some sites suggested 8-16GB RAM but I found recent posts suggesting 32GB or 64GB. It is not also uncommon to see posts from users using 128 or 256GB RAM. What is a reasonable amount of RAM for home computer above which it would be better to use online computing services from companies?