Java Collectors & groupingBy — collect(), Downstream Collectors & Custom Collectors

Why Collectors are the hard part of streams

Most of the Stream API is easy to skim: filter, map, reduce read like a sentence. Collectors are where interviewers separate people who used streams from people who understand them. The reason is that a Collector is not one operation but a small recipe of cooperating functions, and the Collectors factory composes those recipes into each other — a groupingBy whose downstream is a mapping whose downstream is a joining. Once you see collectors as building blocks that nest, the whole API stops feeling like a grab-bag of static methods and starts feeling like a small language for aggregation. This guide builds that mental model from the bottom up.

collect() is a mutable reduction

The collect() terminal operation performs a mutable reduction: instead of folding immutable values together like reduce, it pours elements into a mutable result container — an ArrayList, a HashMap, a StringBuilder — accumulating in place. That in-place mutation is why collect is the efficient way to build a collection: there is no allocation per step, just repeated add.

// reduce combines immutable values; collect mutates one container
List<String> upper = names.stream()
    .map(String::toUpperCase)
    .collect(Collectors.toList());   // pours into one ArrayList

The argument to collect is a Collector, and java.util.stream.Collectors is a factory of ready-made ones. Rule of thumb: reach into the Collectors factory first; only hand-roll a collector when nothing there composes into what you need.

The anatomy of a Collector

A Collector<T, A, R> is three type parameters and up to five functions. T is the input element, A is the mutable accumulation type, R is the final result. The four functions that do the work are the supplier (make an empty container), the accumulator (fold one element in), the combiner (merge two partial containers), and the finisher (turn the container into the result).

// toList, expressed as its four parts:
//   supplier    = ArrayList::new   -> a fresh empty list
//   accumulator = List::add        -> add one element
//   combiner    = (a, b) -> { a.addAll(b); return a; }  // merge two lists
//   finisher    = identity         -> the list IS the result (IDENTITY_FINISH)

The combiner is the piece that makes a collector safe for parallel streams: split the work, fill several containers, then merge them. The finisher is skipped entirely when the container already is the result — that optimization is flagged by the IDENTITY_FINISH characteristic. Understanding these five parts is what lets you read, compose, and eventually write collectors.

toList, toSet, and toCollection

The simplest collectors gather elements into a collection, differing only in the container they fill. toList() accumulates into a List (an ArrayList in practice). toSet() accumulates into a HashSet, dropping duplicates and abandoning order. toCollection(supplier) accumulates into whatever collection you hand it, for when the default won't do.

List<String>    list   = s.collect(Collectors.toList());
Set<String>     set    = s.collect(Collectors.toSet());          // dedup, no order
TreeSet<String> sorted = s.collect(Collectors.toCollection(TreeSet::new)); // sorted

One distinction interviewers love: Collectors.toList() returns a modifiableArrayList, while the newer stream().toList() (Java 16+) returns an unmodifiable list. Prefer stream().toList() for read-only results; keep Collectors.toList() for when you genuinely need to mutate afterward. Java 10's toUnmodifiableList/Set/Map give unmodifiable variants that also reject null.

toMap and the merge function

toMap builds a Map from each element using a key mapper and a value mapper. The default result is a HashMap. The trap, and the most common toMap bug, is duplicate keys: if two elements produce the same key, the two-argument toMap throws IllegalStateException.

// Three-arg form: a merge function resolves key collisions
Map<String, Integer> totals = orders.stream()
    .collect(Collectors.toMap(
        Order::customer,     // key mapper
        Order::amount,       // value mapper
        Integer::sum));      // merge: (existing, new) -> existing + new

// a fourth argument supplies the map type, e.g. TreeMap::new for ordering

The merge function (existing, new) -> result runs whenever two elements collide on a key. Rule of thumb: unless the keys are provably unique, always supply a merge function — it is cheaper to write than to debug the IllegalStateException in production.

groupingBy: the GROUP BY of streams

groupingBy takes a classifier function and partitions elements into a Map<K, List<T>>, where every element sharing a classifier result lands in the same list. The single-argument form gives lists; the two-argument form replaces the list with a downstream collector applied to each group — and that is where the power lives.

// single-arg: Map<Dept, List<Employee>>
Map<Dept, List<Employee>> byDept = emps.stream()
    .collect(Collectors.groupingBy(Employee::dept));

// two-arg: a downstream reshapes each bucket (here, count per dept)
Map<Dept, Long> counts = emps.stream()
    .collect(Collectors.groupingBy(Employee::dept, Collectors.counting()));

Common downstreams are counting, summingInt/Long/Double, averagingInt/Double, mapping, toSet, joining, maxBy/minBy, and reducing. Because the downstream is itself a collector, you can keep composing — which is the whole game.

Reshaping groups: mapping, filtering, nesting

Three patterns turn groupingBy from "bucket of objects" into exactly the shape you want. mapping(mapper, downstream) transforms each element before it reaches the downstream — a map() embedded inside the collector. filtering(predicate, downstream) (Java 9) keeps only matching members but preserves empty group keys, unlike an upstream filter() which deletes the whole bucket. And because a downstream can be another groupingBy, you nest them for multi-level maps.

// names per dept (mapping projects a field before collecting to a list)
Map<Dept, List<String>> names = emps.stream()
    .collect(Collectors.groupingBy(Employee::dept,
             Collectors.mapping(Employee::name, Collectors.toList())));

// two-level: by dept, then by city — like GROUP BY dept, city
Map<Dept, Map<String, List<Employee>>> nested = emps.stream()
    .collect(Collectors.groupingBy(Employee::dept,
             Collectors.groupingBy(Employee::city)));

Use filtering over a pre-filter whenever every group key must appear even with no surviving members; use mapping to collect a field rather than the whole object; nest to mirror a multi-column GROUP BY.

partitioningBy: the two-bucket special case

partitioningBy is a specialized groupingBy for a predicate: it splits the stream into exactly two groups, returning a Map<Boolean, List<T>> keyed by true and false. Its defining difference is that the map always contains both keys, even when one partition is empty — whereas groupingBy simply omits empty groups.

Map<Boolean, List<Integer>> parts = nums.stream()
    .collect(Collectors.partitioningBy(n -> n % 2 == 0));
parts.get(true);   // evens
parts.get(false);  // odds  -- present even if empty

// it accepts a downstream too:
Map<Boolean, Long> howMany =
    nums.stream().collect(Collectors.partitioningBy(n -> n > 0, Collectors.counting()));

Reach for partitioningBy whenever the grouping key is boolean and you want both buckets guaranteed present; it is also marginally faster than the equivalent groupingBy.

joining and the statistics collectors

joining concatenates CharSequence elements into one String, using a StringBuilder internally so it beats reducing with +. The numeric collectors — summingInt/Long/Double, averagingInt/Double, and summarizingInt/Long/Double — extract a value and aggregate it. The summarizing ones are the standout: they compute count, sum, min, max, and average in a single pass.

String csv = names.stream()
    .collect(Collectors.joining(", ", "[", "]"));   // "[Ann, Bob, Cy]"

IntSummaryStatistics stats = emps.stream()
    .collect(Collectors.summarizingInt(Employee::salary));
stats.getCount();   stats.getSum();   stats.getMax();   stats.getAverage();

Note that every averaging* variant returns a Double (a mean is rarely integral). When you need several statistics at once, prefer summarizing* over running three separate collectors — it traverses the stream only once. All of these shine as groupingBy downstreams for per-group totals and stats.

collectingAndThen and custom collectors

collectingAndThen(downstream, finisher) runs a collector and then applies a finishing transformation to its result — bolting a custom finisher onto an existing collector without writing one. Its two classic uses are wrapping a collection as unmodifiable and unwrapping the Optional that maxBy/minBy produce as a groupingBy downstream.

// highest-paid employee per dept, with the Optional unwrapped
Map<Dept, Employee> top = emps.stream()
    .collect(Collectors.groupingBy(Employee::dept,
             Collectors.collectingAndThen(
                 Collectors.maxBy(Comparator.comparingInt(Employee::salary)),
                 Optional::get)));

When nothing composes, build one directly with Collector.of(supplier, accumulator, combiner, [finisher], characteristics...) — the same four functions from the anatomy section, passed by hand. The combiner is mandatory so the collector works in parallel.

Collector<String, StringJoiner, String> upperCsv = Collector.of(
    () -> new StringJoiner(", "),       // supplier
    (j, s) -> j.add(s.toUpperCase()),   // accumulator
    StringJoiner::merge,                // combiner (parallel-safe)
    StringJoiner::toString);            // finisher
String result = names.stream().collect(upperCsv);

Rule of thumb: compose existing collectors with mapping, collectingAndThen, and teeing first; write a fully custom Collector.of only when no combination fits.

Recap

collect() is a mutable reduction driven by a Collector, whose four functions — supplier, accumulator, combiner, finisher — explain everything else (the combiner makes it parallel-safe, the finisher is skipped under IDENTITY_FINISH). Gather with toList/toSet/toCollection, build maps with toMap and always supply a merge function when keys can collide, and group with groupingBy — single, with a downstream, or nested — reshaping buckets via mapping and filtering. Use partitioningBy for the boolean two-bucket case (both keys always present), joining for strings, and the summing/averaging/summarizing collectors for per-group statistics. Finally, collectingAndThen adapts a collector's output and Collector.of lets you write one from scratch — but the real skill is seeing collectors as composable blocks and reaching for the factory before the hand-rolled solution.

More ways to practice