Identity and Equality: Syntax and Semantics

Courtesy of C/C++ Users Journal (October 2003)

Matthew Wilson

The use of C-based syntax in modern C-family and related languages affords a significant advantage to software developers when transferring their skills from one language to others. Unfortunately, this very utility masks some serious pitfalls from the unwary; these pitfalls form the focus of this article.

Object-oriented languages necessarily provide facilities for creating and manipulating objects. Two important aspects of the characteristics of objects are identity and equality. Object identity pertains to the uniqueness of an object. If an object reference is identity compared to another object reference, the result will be true only if the two object references refer to the same object.

Object equality pertains to the value of an object. If an object reference is equality compared to another object reference, the result will be true if the referenced objects are logically equal. This equality can be as simple as the two objects having the same internal values, or may be determined via an equality method (whether instance/non-static or class/static) and therefore as complex/convoluted as the author of the class deems appropriate.

Syntax

There is a significant variation in the syntax used for testing identity and equality across the C-family and related languages, as shown in Table 1.

In C++, testing of identity is done by comparing the addresses of the instances or references concerned with the == operator, and testing of equality by use of the == operator on the references or instances themselves, as in

// C++
class X{};

X  x1 = . . .; // An instance
X  &x2 = . . .;  // A reference

if(&x1 == &x2) // Identity
{
  printf("x1 and x2 refer to the same 
    objects\n");
}
if(x1 == x2) // Equality
{
  printf("x1 and x2 are equal\n");
}

In C#, D, Java, J#, Python, and VB.NET, you cannot directly access object instances. They are only manipulable indirectly via references (except for value types in C#, that is).

Java and the .NET language J# perform identity tests by applying the == operator to the references, and perform equality comparison by calling the equals() (Java) or Equals() (J#) method, as in

// Java & J#
class X{}

X  x1 = . . .; // A reference
X  x2 = . . .; // A reference

if(x1 == x2)
{
  System.out.println("x1 and x2 refer to
    the same objects\n");
}
if(x1.equals(x2)) // Java
if(x1.Equals(x2)) // J#
{
  System.out.println("x1 and x2 are 
    equal\n");
}

The implementation of equals() (Java), Equals() (C#, J#, VB.NET), or eq() (D) in the root-most Object class simply conducts an identity test by comparing only the values (i.e., the addresses of their referenced instances) of the references themselves, not the values of the referenced instances. Hence, equality comparison is only available in these languages to classes that explicitly provide it by implementing and overriding these methods, or by inheriting from classes that do so. If you create a type whose value can be meaningfully compared — such as a String, a Person, an XmlNode — you will want to overload the equality method.

Python and VB.NET use the is/Is keywords for identity comparison and ==/= for equivalence, as in

// Python

if x1 is x2:
  print "x1 and x2 refer to the same object"
if x1 == x2:
  print("x1 and x2 are equal"

// VB.NET

If (x1 Is x2) Then
  System.Console.WriteLine("x1 and x2 
    refer to the same object")
End If
If x1.Equals(x2) Then
  System.Console.WriteLine("x1 and x2 
    are equal")
End If

D uses === / !== for identity comparison and == / != for equality comparison, as in

// D
if(x1 === x2)
{
  printf("x1 and x2 refer to the same objects\n");
}
if(x1 == x2)
{
  printf("x1 and x2 are equal\n");
}

There is a fair amount of variation in syntax with all these languages. The first problem is not just that the syntax differs between the languages, but that the syntaxes overlap and therefore will likely lead to mistakes. For example, C++ and D use == for comparing equality, which seems a valid choice given the fact that they are C-family languages and C uses that operator for equality. For basic types, Java and J# use == for comparing equality, but for object types (or rather, object references), == compares identity. It can be argued that this is consistent in one sense, since it is the values of object references that are being compared for equality, which translates to a comparison of identity for the referenced objects. Nonetheless, it is needlessly confusing and leads to problems when mixing languages.

It can be argued that this is a biased perspective, from one with a C++ background, and that Java programmers quickly get used to the situation. Nonetheless, there is an inconsistency between the syntax and semantics of built-in types, for which == provides value comparison and there is no identity comparison, and class types, for which equals() provides value comparison and == provides identity comparison. But you have to wonder how they will deal with implementing generics in future versions of the language if the semantics of such fundamental operators are dependent on the types to which they’reapplied.

Even after half a decade of Java, I still occasionally use == when I should use equals(). Python offers a better approach by providing the is keyword, which performs identity tests, and saving == for equivalence testing. But this is such a rich seam of fervent lingua-religious debate, so I’m content just to stipulate that my preferences are partial and as (in)valid as the next.

What is important is to note that there are inconsistencies. This leads to the second problem in our language family. Perhaps mindful of the confusing variation in syntax, the architects of C# seem to have tried to obviate the situtation, and at the same time win over C++ programmers, by providing for the overloading of the == and != operators for equivalence testing. Alas, they’ve opened a hornet’s nest of potential problems.

C# allows you to use the Object class/static method ReferenceEquals() for conducting an identity test. This method works fine and, aside from the fact that it appears to not be inlined (as we will see in the second installment of this article), it is the appropriate choice. ReferenceEquals() can be considered equivalent to the Python is keyword. For equality tests, C# provides the two-parameter Object class/static method Equals() and the one-parameter overrideable Object instance/non-static method Equals(). The two are used as follows:

// C#
if(Object.ReferenceEquals(x1, x2))
{
  printf("x1 and x2 refer to the same objects\n");
}
if(Object.Equals(x1, x2)) // Class/static
{
  printf("x1 and x2 are equal\n");
}
if(x1.Equals(x2))         // Instance/non-static
{
  printf("x1 and x2 are equal\n");
}

As I said, C# provides the facility to overload the == and != operators on a per-class basis. (In fact, it mandates that if you overload == then you must overload !=, and vice versa, which is probably a good idea.) However, this is an accident waiting to happen. If your class does not overload these operators, and does not inherit from one that does, then == will translate to an identity check, i.e., equivalent to a call to Object.ReferenceEquals(). Conversely, if your class, or one of its ancestors, does overload these operators, use of == will result in a call to that overloaded operator which, as we see below, will be implemented as a call to the instance/non-static Equals() method for that class. (C# also requires you to provide overloads of Object.Equals() and Object.GetHashCode() if you overload == and !=. Again this is probably a good idea, though it does make writing simple test programs for article research somewhat tedious.)

I don’t know if I’m the only one who thinks so, but this situation seems crazy. Testing for identity and equality are completely different things conceptually [1]. To not know which you are getting when you type in == is nonsensical. Having to resort to the documentation (should there be any) or the assembler on a per-class basis to elicit the semantics of basic syntactic elements doesn’t stand up. Even worse, the class(es) you are working with may be redefined after your first version of code is written and working, resulting in both semantic and probably performance changes to your code without you having altered a character. C# aficionados may counter that C++ (which is, perhaps obviously, my primary language of choice) provides the same dangers, if not more so. But this does not hold water. In C++, the semantics of == are only changed from meaning a test of equality by incompetent or malicious action. In C# they can be shifted from identity to equality, or vice versa, by the informed action of experienced developers. (Of course, the .NET version-locking mechanisms are designed in part to prevent such potentially destructive changes in semantics from affecting client code, but that can only work when they are used. During development, or when using publicly available binaries/source, it can be impractical/undesirable to operate these mechanisms, hence the risk is real.)

Null

There is another gotcha lurking under the surface of all of these languages, save for C++. Apart from C++ [2] and Python [3], all the languages discussed here can have null references. Calling a method on a null reference leads either to a NullReferenceException/NullPointerException being thrown (C#, Java, J#, VB.NET) or to a generic Exception (with “Access Violation” message) being thrown in D. [4]

In C++, equality comparison is a first-class concept, supported by both the language and the compiler, and adaptable and re-definable under programmer control. Hence, writing x == y has well-defined and supported semantics.

In C#, D, Java, J#, Python, and VB.NET, equality is provided by calling instance/non-static methods (Equals(), equals(), eq(), __eq__()) on reference instances (see Table 1). This reflects the fundamental difference with C++, in that a reference may never be (legally) null in C++ [2], but a reference can be null in the other languages (except for Python [3]). An analogy would be so-called “smart” pointers in C++, where operator ->() provides access to the “actual” object instance, but there is no guarantee that the result of sp->() is non-null.

Thus, when calling the equality method in these languages you must ensure that the instance on which it is called (and probably also the one serving as argument) is non-null, as in

// C#, J#
if( x1 != null &&
    x2 != null &&
    x1.Equals(x2)) . . .
// Java
if( x1 != null &&
    x2 != null &&
    x1.equals(x2)) . . .
// D
if( x1 !== null &&
    x2 !== null &&
    x1.eq(x2)) . . .
// VB.NET
if( x1 Is Not Nothing &&
    x2 Is Not Nothing &&
    x1.Equals(x2)) . . .

Without such checks, the calls will result in the throwing of an exception. But this is ugly stuff, to be sure! Such checks only need to be made in circumstances where the validity of a reference has not already been established, so would not be needed on every test. Nevertheless, it remains a common and necessary thing.

D also provides the == and != operators for equality tests that translate into calls to eq(), as follows: x1 == x2 and x1 != x2 will be translated by the compiler into x1.eq(x2) and !x1.eq(x2). This is a nice convenience as far as it goes; I’ve never written a != that wasn’t a logical inverse of ==.

D does not “overload” the meaning of this operator by also using it to test for identity, as do C#, Java, and J#: identity checks in D use the === and !== operators — which is a good thing. Unfortunately, because of the automatic translation into calls to the eq() instance/non-static method, if the left-hand parameter is a null reference, or the null keyword itself, an exception (Exception(“Access Violation”) is thrown!

As noted above, C# provides the ability to overload == and !=, and it does this on a class/non-static method basis. This is better than the D solution, but only as long as you either test for the operands being null, or pass them off to a compare function that makes such checks (such as Object.Equals()). Correctly written C# == and != overloads do, therefore, represent a superior solution for equality testing. (Or they would were it not for the semantic “overload” clash with identity checking, mentioned above.)

The situation with Python is somewhat similar. There are no null references, so you don’t have this precise problem. However, because it is untyped, it is possible to call x == y where the two operands are completely different types. In such a circumstance, the __eq__ method of the x variable will be used if one has been defined, otherwise the interpreter will do a member-wise comparison. In the former case, the implementation will reference members on the right-hand operand, which may well not exist, resulting in an AttributeError exception being thrown.

All this is one in the eye for the critics of C++’s support for pointers, since == on a pointer compares identity and on a reference/instance compares equality. Of the pointers-as-references languages, perhaps the most sensible approach is taken in D, which uses == for equivalence and === for identity, although there is a nasty gotcha with that as well.

In my opinion, the best way for these things to be handled in the non-pointer languages would be a combination of the approaches of D, C#, and Python. Identity testing would always be effected by the is keyword, just as Python does. As an aside, I would suggest the test against type would be via the isa keyword (this is what the C# is keyword does). Equality testing would be via the == / != keywords, which would be automatically translated by the compiler into to an Equals() method, in the way D does. However, it would guarantee that the object on which Equals() is called and the parameter given to it (which correspond to the left- and right-hand sides of the == & != operators) would be non-null. Where necessary, the compiler would emit code to test for null references, but where the references can be determined to be non-null, the compiler would optimize out the null tests. This solution would thus fold semantics into a single meaning, ensure safety, and maximize efficiency.

Notes

[1] Equality testing is kind of a superset of identity testing in that when two object references refer to the same object, it is axiomatic that the objects so referenced are equal.

[2] One can perversely create a null reference by X &xr = *(X*)0 but this is a deliberate violation of the language rules, rather than a normal part of C++ usage.

[3] I am no Python expert, but as far as I know one cannot have a null reference because they are created when assigned to, and there is no null keyword to act as the right-hand side of an assignment.

[4] Walter Bright informs me that he plans to change this to be an instance of SystemException or derived class.

Matthew Wilson is a software development consultant for Synesis Software, specializing in robustness and performance in C, C++, C# and Java. Matthew is the author of the STLSoft libraries, and the forthcoming book Imperfect C++ (to be published by Addison-Wesley, 2004). He can be contacted via matthew@synesis.com.au or at http://stlsoft.org/.