why is parallel flow slower?

There are several problems going on here in parallel, so to speak.

The first is that solving a problem in parallel always results in more effective work than sequential work. Overhead is involved in dividing work across multiple threads and merging or merging the results. Problems like converting short strings to lowercase are small enough to risk being swamped with parallel division overheads.

The second problem is that Java benchmarking is very subtle, and it’s very easy to get confusing results. Two common problems are JIT compilation and dead code elimination. Short benchmarks often finish before or during JIT compilation, so they’re not measuring peak throughput, and in fact may be measuring JIT itself. When compilation occurs it is somewhat nondeterministic, so the results can vary wildly as well.

For small synthetic benchmarks, the workload often calculates the results that are eliminated. JIT compilers are pretty good at detecting this and eliminating code that doesn’t produce results that are used everywhere. This is probably not happening in this case, but if you tinker with other synthetic workloads, it can certainly happen. Of course, if SIC takes the workload off the benchmark, it renders it useless.

I highly recommend using a well-developed benchmarking framework such as JMH instead of spinning one of your own by hand. JMH has facilities to help avoid common benchmarking pitfalls, including these, and is fairly easy to set up and run. Here is your converted benchmark for using JMH:

package com.stackoverflow.questions;

import Java.util.Arrays;
import Java.util.List;
import Java.util.stream.Collectors;
import Java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.*;

public class SO23170832 {
    public static class BenchmarkState {
        static String[] array;
        static {
            array = new String[1000000];
            Arrays.fill(array, "AbabagalamagA");

    public List sequential(BenchmarkState state) {
                  .map(x -> x.toLowerCase())

    public List parallel(BenchmarkState state) {
                  .map(x -> x.toLowerCase())

I ran it using the command:

Java -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

(The options indicate five warm-up iterations, five benchmark iterations, and a forked JVM.) While running, JMH emits many detailed messages, which I have evaded. The summary results are as follows.

Benchmark                       Mode   Samples         Mean   Mean error    Units
c.s.q.SO23170832.parallel      thrpt         5        4.600        5.995    ops/s
c.s.q.SO23170832.sequential    thrpt         5        1.500        1.727    ops/s

Note that the results are in operations per second, so it appears that parallel running is about three times faster than sequential running. But my machine only has two cores. Hmmm. And the average error per run is actually greater than the average run time! WAT? Something suspicious is going on.

This brings us to a third problem. Looking more closely at the workload, we can see that it allocates a new String object for each input and also collects the results into a list, which involves a lot of reallocating and copying. I imagine this will involve a fair amount of garbage collection. We can see this by re-running the benchmark with GC messages enabled:

Java -verbose:gc -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

This gives results such as:

[GC (Allocation Failure)  512K->432K(130560K), 0.0024130 secs]
[GC (Allocation Failure)  944K->520K(131072K), 0.0015740 secs]
[GC (Allocation Failure)  1544K->777K(131072K), 0.0032490 secs]
[GC (Allocation Failure)  1801K->1027K(132096K), 0.0023940 secs]
# Run progress: 0.00% complete, ETA 00:00:20
# VM invoker: /Users/src/jdk/jdk8-b132.jdk/Contents/Home/jre/bin/Java
# VM options: -verbose:gc
# Fork: 1 of 1
[GC (Allocation Failure)  512K->424K(130560K), 0.0015460 secs]
[GC (Allocation Failure)  933K->552K(131072K), 0.0014050 secs]
[GC (Allocation Failure)  1576K->850K(131072K), 0.0023050 secs]
[GC (Allocation Failure)  3075K->1561K(132096K), 0.0045140 secs]
[GC (Allocation Failure)  1874K->1059K(132096K), 0.0062330 secs]
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.stackoverflow.questions.SO23170832.parallel
# Warmup Iteration   1: [GC (Allocation Failure)  7014K->5445K(132096K), 0.0184680 secs]
[GC (Allocation Failure)  7493K->6346K(135168K), 0.0068380 secs]
[GC (Allocation Failure)  10442K->8663K(135168K), 0.0155600 secs]
[GC (Allocation Failure)  12759K->11051K(139776K), 0.0148190 secs]
[GC (Allocation Failure)  18219K->15067K(140800K), 0.0241780 secs]
[GC (Allocation Failure)  22167K->19214K(145920K), 0.0208510 secs]
[GC (Allocation Failure)  29454K->25065K(147456K), 0.0333080 secs]
[GC (Allocation Failure)  35305K->30729K(153600K), 0.0376610 secs]
[GC (Allocation Failure)  46089K->39406K(154624K), 0.0406060 secs]
[GC (Allocation Failure)  54766K->48299K(164352K), 0.0550140 secs]
[GC (Allocation Failure)  71851K->62725K(165376K), 0.0612780 secs]
[GC (Allocation Failure)  86277K->74864K(184320K), 0.0649210 secs]
[GC (Allocation Failure)  111216K->94203K(185856K), 0.0875710 secs]
[GC (Allocation Failure)  130555K->114932K(199680K), 0.1030540 secs]
[GC (Allocation Failure)  162548K->141952K(203264K), 0.1315720 secs]
[Full GC (Ergonomics)  141952K->59696K(159232K), 0.5150890 secs]
[GC (Allocation Failure)  105613K->85547K(184832K), 0.0738530 secs]
1.183 ops/s

Note: Lines starting with # they are normal JMH output lines. Everything else is GC messages. This is only the first of five warm-up iterations, preceding five reference iterations. GC messages continued in the same vein during the rest of the iterations. I think it is safe to state that the measured performance is dominated by GC overload and that the reported results should not be believed.

At this point it is not clear what to do. This is purely a synthetic workload. It clearly takes very little CPU time to actually work compared to allocating and copying. It’s hard to tell what you are really trying to measure here. One approach would be to invent a different workload that is in a sense more “real”. Another approach would be to modify the heap and GC parameters to avoid GC while running the benchmark.

Leave a comment