Mr. Lott has been involved in over 70 software development projects in a career that spans 30 years. He has worked in the capacity of internet strategist, software architect, project leader, DBA, programmer. Since 1993 he has been focused on data warehousing and the associated e-business architectures that make the right data available to the right people to support their business decision-making. Steven is a DZone MVB and is not an employee of DZone and has posted 139 posts at DZone. You can read more from them at their website. View Full User Profile

How do I use all my cores?

03.15.2010
| 13335 views |
  • submit to reddit
News Flash: Multi-core programming is "hard". EVERYBODY PANIC.

ZOMFG: We either need new tools, new languages or both! Right Now!

Here's one example. You can find others. "Taming the Multicore Beast":

The next piece is application software, and most of the code that has been written in the past has been written using a serial approach. There is no easy way to compile that onto multiple cores, although there are tools to help.

What?

That's hooey. Application software is already working in a multicore environment; it has been waiting for multi-core hardware. And it requires little or no modification.

Any Linux-based OS (and even Windows) will take a simple shell pipeline and assure that the processing elements are spread around among the various cores.

Pipelines and Concurrency


A shell pipeline -- viewed as Programming In The Large -- is not "written using a serial approach". Each stage of a shell pipeline runs concurrently, and folks have been leveraging that since Unix's inception in the late 60's.

When I do python p1.py | python p2.py, both processes run concurrently. Most OS's will farm them out so that each process is on its own core. That wasn't hard, was it?

I got this email recently:
 
Then today, I saw the book Software Pipelines and SOA: Releasing the Power of Multi-Core ProcessingBy Cory Isaacson
At that point, I figured that there are a lot of yahoos out there that are barking up the wrong tree.

I agree in general. I don't agree with all of Isaacson's approach. A big ESB-based SOA architecture may be too much machinery for something that may turn out to be relatively simple.

Easy Problems


Many problems are easily transformed into map-reduce problems. A "head" will push data down a shell pipeline. Each step on the pipeline is a "map" step that does one incremental transformation on the data. A "reduce" step can combine data for further maps.

This can be expressed simply as: head.py | map1.py | map2.py | reduce1.py | map3.py. You'll use both cores heavily.

Optimization


Some folks like to really focus on "balancing" the workload so that each core has precisely the same amount of work.

You can do that, but it's not really going to help much. The OS mostly does this by ordinary demand-based scheduling. Further fine-tuning is a nice idea, but hardly worth the effort until all other optimization cards have been played. Even then, you'd simply be moving the functionality around to refactor map1.py | map2.py to be a single process, map12.py.

Easy and well-understood.

Harder Problems


The Hard Problems involve "fan-out" and "fan-in". Sometimes we think we need a thread pool and a queue of processing agents. Sometimes this isn't actually necessary because a simple map-reduce pipeline may be all we need.

But just sometimes, there's a fan-out where we need multiple concurrent map processors to handle some long-running, complex transformation. In this case, we might want an ESB and other machinery to handle the fan-out/fan-in problem. Or, we might just need a JMS message queue that has a one writer and multiple readers (1WmR).

A pipeline has one writer and one reader (1W1R). The reason why fan-out is hard is that Linux doesn't offer a trivial (1WmR) abstraction.

Even fan-in is easier: we have a many writer one reader (mW1R) abstraction available in the select function.

The simplest way to do fan-out is to have a parent which forks a number of identical children. The parent then simply round-robins the requests among the children. It's not optimal, but it's simple.

Bottom Line


Want to make effective use of your fancy, new multi-core processors?

Use Linux pipelines. Right now. Don't wait for new tools or new languages.

Don't try to decide which threading library is optimal.

Simply refactor your programs using a simple Map-Reduce design pattern.
References
Published at DZone with permission of Steven Lott, author and DZone MVB. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)

Comments

Will Dwinnell replied on Mon, 2010/03/15 - 9:08am

I suppose the difficulty of taking advantage of multiple cores depends on what kind of computing one wants to do, and what development tools are at one's disposal. Under ideal circumstances, such as number crunching which can conveniently be broken into nonsequential pieces, using software made for this purpose, this is actually quite easy. See, for instance, my experience using the Parallel Computing Toolbox with MATLAB:

Parallel Programming: A First Look (Data Mining in MATLAB)

 

-Will Dwinnell

 

Edgar Sánchez replied on Mon, 2010/03/15 - 11:36am

I think the fastest way to go would be to implement libraries inside the Java SDK along the lines of the .NET 4 Task Parallel Library ( http://msdn.microsoft.com/en-us/library/dd460717(v=VS.100).aspx ), I must say that .NET Parallel Extensions are a pleasure to work with :-)

Jilles Van Gurp replied on Mon, 2010/03/15 - 2:35pm

This isa classic hammer nail kind of thing where pipes and filters are a particularly old hammer and map reduce is about the same age but recently re-discovered. Neither will solve all your problems.

Yes, concurrency is hard, especially if you have no background in computer science, i.e. if you lack basic understanding of the abstractions that can make your life easier. If you do have such a background, the next step is understanding the different patterns that exist in this space. Producer consumer, semaphores, message queues, callbacks, functions without side effects, threads, blocking/non blocking IO, etc. 

Still with me? Now the good news. Your needs are probably quite modest and well covered by some existing framework. Using the java.concurrent api is not exactly easy but if use properly will allow you to dodge synchronization issues.

A few useful tricks that you should practice regardless of whether you are going to run with multiple threads:

- Don't use global variables.

- Don't share object state with mutable objects.

- Use Dependency injection (i.e. don't initialize objects yourself). Keep the number of dependencies per class low.

- Separation of concerns. Make methods only one thing and keep your classes cohesive (i.e. don't dump random methods in one class).

If you do all this properly, your design will make a shared nothing approach a lot easier. Shared nothing is what you need to parallelize. Shared something means context switches and synchronization. These are the two things that make concurrent programming hard. If you can do shared nothing, concurrency is easy.

If you can't, work on how you share between processes/threads. Asynchronous is great here. Use call back mechanisms or some kind of queueing solution. Avoid manipulating semaphores and locks yourself, leave that to some off the shelf solution. 

Unix pipelines are great if all processes in the pipe line are independent, don't contest the same resources, need to do about the same amount of work, and can work on partial results as they are streamed from the predecessor. If not, you've got a clogged pipe and a bunch of processes waiting for it to become unclogged.

I recently used a for loop in bash to fork 120 commands for a large input file that I had split into smaller files. Then each command was several unix commands piped together. Unix is good at this (too a certain limit). Four cores maxed out, the process finished about 20x as fast as the non parallized version which was basically a clogged pipe. Map recuce frameworks do the same: split the input and work on each chunk in parallel. Great stuff. If you do a lot of batch processing in Java, take a good look at Spring Batch instead.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.