Some Preaching
Premature optimization is the root of all evil.
What he meant was that spending time optimizing the parts of a program that are rarely executed has limited impact on overall performance and yet can waste substantial developer effort, particularly if attempted before the critical parts of a program have been identified.
Premature generalization is a related but opposite evil.
It is characterized by the premature use of generalizing elements before a concrete business need has been defined.
Examples might include overly broad public surface area; unnecessary degrees of freedom, overloads or parameters; unneeded extensibility points; or gratuitous polymorphism.
Premature optimization is an evil because it wastes developer time while producing little valuable output.
Premature generalization is an evil because it produces a more complicated, less performant, more difficult to test, and more difficult to debug product.
Oh, and it wastes developer time while producing little value output.
One of the main difficulties with premature generalization is that its costs to the product are VERY hard to measure.
Profiling can help us find the hotspots in code and focus our optimization efforts are the critical parts.
Premature generalization is often referred to as a “peanut butter” effect because it is not localized to one area but instead is evenly spread over the entire product in a way that doesn’t often show up in a profile.
It is a death by a thousand cuts. Because it is hard to measure, it is hard to detect and remove later.
strategy is to avoid it from the beginning through the use of style guidelines, best practices, surface area minimization and strong code review. Tools such as Static Analysis can further assist with these.
One of the most common examples of premature generalizations is the overuse of interfaces (in the programming language syntax sense, not the design sense). The definition of IFoo just so the concrete type Foo can implement it.
The use of IList
The use of Stream when every known use case will actually create or expect a fully materialized contiguous memory buffer which could have been much more cheaply allocated as Memory
While there are certainly legitimate and powerful reasons to use an interface, their overuse adds a peanut butter cost.
Some Theory
How big is that cost? In an attempt to develop an intuition for this I wrote a few micro-benchmarks (my favorite past time). In the first benchmark I asked the question: what is the relative cost of:
• Direct Call: calling an instance method on a sealed concrete type (virtual or not). • Abstract Call: calling a virtual method through an abstract type. • Interface Call: calling an interface method through an interface.
To understand why these might be different it will be helpful to take a slight detour into thinking about (a simplified version of) how method calls are made in C#.
An object reference is a single pointer (machine word sized) that points to a memory allocation on the heap that is the object state. The memory allocation has an object header followed by the actual object state (i.e. fields).
The object header contains a few things but primarily it contains a pointer to another piece of memory holding the type metadata which includes the vtable and interface dispatch slots for the actual type of the original object.
All objects of the same type share the same type metadata.
To make a Direct Call, the runtime needs to know both the address of the object (which is the object reference itself) and the address of the method’s implementation to jump to.
For a sealed concrete type, the specific implementation that ends up being used at runtime is known statically at compile-time (or JIT time). The method’s address can be included directly at the call site embedded in the jump instruction.
When making an Abstract Call through an abstract type, finding the address of the method is not so easy.
The method could be overridden by any type derived from the abstract type.
The actual version of the method can only be known at runtime by examining the actual type of the object rather than its static type.
To do this the compiler emits an indirect read to first follow the object pointer to the object header, follow the type metadata pointer to the type metadata and index the vtable which includes an ordered list of all the virtual method addresses.
Since the compiler knows statically the index of the slot in the vtable, all of these indirections are “constant time” and only add a few extra instructions per call.
When making an Interface Call through an interface type, finding the address is still more complicated.
Because many different actual types may implement the same interface, and each of those types have different vtable layouts, the compiler can’t know statically the index of the slot in the vtable for a given interface method’s implementation.
There some clever optimization to solve this problem (read the real story here) but a simplified view of it is that the runtime uses a dispatch map that provides a lookup based on the actual type of the object to find the vtable index where the interface begins and then proceeds in the same way it would for a virtual call. This adds a few more instructions.
Now the compiler (or JIT) tries to be super smart, so sometimes it can convert an abstract call to a direct call.
This is a process called de-virtualization. Additionally, direct calls can further be inlined where the code from the called method is copied directly into the caller and unused paths are eliminated.
Inlining has a huge impact on performance as it can eliminate the overhead of a direct call entirely. Abstract calls (unless de-virtualized) and interface calls cannot be inlined because the target can’t be known until runtime (unless you have a path based JIT - which C# doesn’t - yet).
Some Numbers
So, what does this all add up to? Let’s show some numbers! Consider this abstract type:
private abstract class AbstractType
{
public abstract int Get(int i);
public abstract int GetIndirect(AbstractType obj, int i);
}
And this micro-benchmark:
long total = 0;
foreach (int i in Data)
{
total += obj.[Get|GetIndirect]([obj,] i);
}
Where Data is some set of integers (say 1..100K) and we call either the Get method or the GetIndirect method. GetIndirect tests what it costs to pass an interface as a capability to a method and so involves potentially TWO calls, the outer and then an inner call to the argument again.
Method | Mean | Error | StdDev | Median | Rank |
---|---|---|---|---|---|
Concrete | 88.56 μs | 0.169 μs | 0.247 μs | 88.46 μs | 1 |
ConcreteIndirect | 108.95 μs | 0.239 μs | 0.343 μs | 108.80 μs | 2 |
Abstract | 218.44 μs | 0.898 μs | 1.317 μs | 217.84 μs | 3 |
AbstractIndirect | 236.12 μs | 6.789 μs | 9.517 μs | 228.20 μs | 4 |
Interface | 272.45 μs | 0.745 μs | 1.045 μs | 272.11 μs | 5 |
InterfaceIndirect | 423.38 μs | 10.499 μs | 15.390 μs | 410.26 μs | 6 |
We see a clear stratification in the results where concrete direct calls (potentially with inlining) outperform abstract calls, while abstract calls in turn outperform interface calls. This lines up with our intuition from above.
Direct Calls, Inlining, and the BCL
This is all great and all… but the costs here are tiny. Peanut butter. What is the impact of this on abstractions? To give some intuition about this, let’s look at your and my favorite collection type: lists. Let’s change our micro benchmark to be:
long total = 0;
foreach (int i in List)
{
total += List[i];
}
Where List is a linear collection of integers. This benchmark looks at several costs including the call to the Get(index) method (aka the indexer), and the cost of calling GetEnumerator and enumerating the result.
Method | Mean | Error | StdDev | Median | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|
ReadOnlyMemory | 67.11 μs | 0.255 μs | 0.358 μs | 66.94 μs | 1 | - | - | - | 1 B |
Array | 70.55 μs | 0.411 μs | 0.602 μs | 70.43 μs | 2 | - | - | - | 2 B |
List | 326.80 μs | 0.801 μs | 1.123 μs | 326.32 μs | 3 | - | - | - | 4 B |
IList | 910.19 μs | 28.101 μs | 39.393 μs | 941.92 μs | 4 | - | - | - | 40 B |
ReadOnlyCollection | 989.41 μs | 10.704 μs | 15.006 μs | 981.92 μs | 5 | - | - | - | 51 B |
ImmutableList | 10,830.06 μs | 118.628 μs | 166.299 μs | 10,895.91 μs | 6 | - | - | - | - |
Again we see that concrete types outperform virtual types, which in turn out perform interface types.
Note here that the List, IList, and ReadOnlyCollection (which wraps an IList) are actually the exactly same object.
That object is only being accessed in a different way. (ImmutableList was thrown in here to demonstrate that you should be careful about making assumptions, it is actually a linked list not a linear type at all… and the perf shows.)
Another thing to note here is the “Allocated” column on these results.
While this value is a little noisy, it should be clear that getting an enumerator from an interface allocates while getting one from a concrete or abstract type does not. The reason for this should be clear.
An interface’s GetEnumerator method MUST return an IEnumerator interface.
A concrete or abstract type can return a concrete enumerator type (and many in the BCL do) which often is a struct (allocation free).
Gratuitous short-lived heap allocations also increase Gen0 memory pressure which again adds more peanut butter cost in garbage collection.
Lastly, let’s apply the same micro-benchmark to the second most used collection type: maps. Our micro benchmark is now:
long total = 0;
foreach (int i in Dict.Keys)
{
if (Dict.TryGetValue(i, out int value))
{
total += value;
}
}
And we are looking at the same costs as above: performing a lookup and enumerating (in this case the keys).
Method | Mean | Error | StdDev | Rank | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|
Dictionary | 1.152 ms | 0.0056 ms | 0.0081 ms | 1 | - | - | - | 18 B |
IReadOnlyDictionary | 1.539 ms | 0.0172 ms | 0.0247 ms | 2 | - | - | - | 61 B |
IDictionary | 1.622 ms | 0.0036 ms | 0.0048 ms | 3 | - | - | - | 58 B |
ConcurrentDictionary | 2.422 ms | 0.0371 ms | 0.0544 ms | 4 | 15.6250 | 15.6250 | 15.6250 | 400190 B |
The results follow the same pattern with concrete/abstract types being faster and interface enumeration allocating.
The last interesting thing to note here is that ConcurrentDictionary is expensive.
Though it is made to look like a regular Dictionary for most programming purposes, to achieve lock-free concurrency it is actually a much more complicated implementation. Note it also allocates.
Not just during enumeration (as IDictionary does) but also during lookup! This is because there are per-thread components to the data structure. As the micro-benchmark executes across multiple different threads it keeps allocating internal data structures to manage those per-thread components.
As threads come and go in the thread pool those per-thread component are allocated and released.
Don’t get me wrong, ConcurrentDictionary can still be more performant that using a mutex/lock around a Dictionary, but if you can achieve memory isolation (e.g. immutable snapshots) instead the performance can be much improved. Don’t use a ConcurrentDictionary where a plain old Dictionary would also work.
Some Conclusions
My simple take away from all this is a rule of thumb (but not an absolute): Use concrete types where possible. Use abstract types when you need to. Use interfaces only when necessary.