Inconsistencies with the `==` operator in R
I found a bug with the == operator in R!
By Daniel Chen
August 6, 2019
One of the cool things about working on gradethis
(grader
need to be renamed)
is that we end up doing things that aren’t common in R (i.e., grading and comparing code).
I discovered an inconsistency with the ==
operator when comparing (long) R expressions.
A quick primer on expressions
In R, you can create an expression using the
quote()
function.
This is essentially the code that R will execute.
It is similar to the “string” that will be executed,
but an actual string in R will return a string, not a command or set of instructions that R will execute.
Compare:
3 + 3
## [1] 6
Which will return the executed result of 3 + 3
and
"3 + 3"
## [1] "3 + 3"
which will return the string "3 + 3"
with:
quote(3 + 3)
## 3 + 3
which returns the expression 3 + 3
that is the instruction to R without actually evaluating it.
If we wanted to evaluate the expression, we can call eval
.
eval(quote(3 + 3))
## [1] 6
You can read more about expressions in the Expressions Chapter in Advanced R.
The “bug”
The bug was detected in gradethis
where we want to compare student submitted code with the instructor solution.
There are multiple steps in the comparison process,
but the first step is to simply check if the two bits of code are the same.
That way, we can stop there and not have to go through the process to detect where the actual differences are.
The comparison code was originally written to use ==
to compare the expressions.
user <- quote(3 + 3)
solution <- quote(3 + 3)
user == solution
## [1] TRUE
Garrett Grolemund put in a bunch of examples that show some weird behaviour.
I initially thought it had to do with name spacing the function name, or after using the :
notation to select columns in a dataframe via tidyselect
.
When the two expressions are the same, we get TRUE
as expected
# supposed to return TRUE
u <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = TRUE))
s <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = TRUE))
u == s
## [1] TRUE
But when we change the values for na.rm
, we also get TRUE
when the expressions are not the same.
# supposed to return FALSE
u <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = TRUE))
s <- quote(tidyr::gather(key = key, value = value, new_sp_m014:newrel_f65, na.rm = FALSE))
u == s
## [1] TRUE
But it seems if we get rid of the tidyselect column selector, we get the correct result.
# If we remove the third argument the error goes away
u <- quote(tidyr::gather(key = key, value = value, na.rm = TRUE))
s <- quote(tidyr::gather(key = key, value = value, na.rm = FALSE))
u == s
## [1] FALSE
I brought this up on our daily shiny-core stand-ups and Winston Chang thought it may have something to do with the deparse
function since it doesn’t actually matter what the expressions being compared are.
u <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 1))
s <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 2))
u == s
## [1] TRUE
You can see Winston’s comment and link to R code in question here.
Pretty much when ==
is used to compare expressions, the expressions are passed through deparse
.
When deparse
is run on an expression, it breaks it up into vectors that are 60L
characters long,
which is fine, but the R bug is when the comparison is only performed with the first element of the vector.
That’s why only the end of the expressions seem to “not matter”.
Reporting the bug
I reported the findings to the r-devel mailing list
Where, even after botching my first listserv submission, I got a response from Martin Maechler (R-core)
Looking at that and its context, I think we (R core) should reconsider that implementation of ‘==’ which indeed does about the same thing as deparse {which also truncates at some point by default; something very very reasonable for error messages, but undesirable in other cases}.
But I think it’s fair expectation that comparing calls [“language”] with ‘==’ should compare the full call’s syntax even if that may occasionally be very long.
So it is actually a behavior that will get patched one day.
The fix
We ended up making changes
to gradethis
by using identical()
while comparing quoted expressions.
u <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 1))
s <- quote(f(x123456789012345678901234567890123456789012345678901234567890, 2))
identical(u, s)
## [1] FALSE
Using identical()
is a much better way when we are comparing code and results, because ==
will return a matrix when comparing 2 dataframes where using all
has problems when there are NA
missing values.
We want to see if the 2 vectors are the same
u <- c(1, 2, 3)
s <- c(1, 2, NA)
all(u == s)
## [1] NA
We can remove missing values, but now when either the user code or solution code does contains an NA
it gets ignored.
u <- c(1, 2, 3)
s <- c(1, 2, NA)
all(u == s, na.rm = TRUE)
## [1] TRUE
u <- c(1, 2, NA)
s <- c(1, 2, 3)
all(u == s, na.rm = TRUE)
## [1] TRUE
Now, we
nudge toward using identical and
raise a warning
when we detect ==
.
u <- c(1, 2, NA)
s <- c(1, 2, 3)
identical(u, s)
## [1] FALSE
Does Donald Knuth owe me a dollar now?