
Yesterday, I took a look at the varying perspectives taken with regards to the Zune 30 debacle. Today, I’ll take a look at what exactly led the Zune 30s to freeze. If you’d like to see the code for the entire driver, click here.
Below the fold lies a sufficiently sized code sample with everything you’ll need to understand what happened with the Zune 30 bug.
Relevant code is non-gray code:
#define ORIGINYEAR 1980
BOOL ConvertDays(UINT32 days, SYSTEMTIME* lpTime)
{
int dayofweek, month, year;
UINT8 *month_tab;
//Calculate current day of the week
dayofweek = GetDayOfWeek(days);
year = ORIGINYEAR;
while (days > 365)
{
if (IsLeapYear(year))
{
if (days > 366)
{
days -= 366;
year += 1;
}
}
else
{
days -= 365;
year += 1;
}
}
// Determine whether it is a leap year
month_tab = (UINT8 *)((IsLeapYear(year))? monthtable_leap : monthtable);
for (month=0; month<12; month++)
{
if (days <= month_tab[month])
break;
days -= month_tab[month];
}
month += 1;
lpTime->wDay = days;
lpTime->wDayOfWeek = dayofweek;
lpTime->wMonth = month;
lpTime->wYear = year;
return TRUE;
}
So what’s the purpose of the while loop? To put it simply, it’s designed to get the number of years from the number of days since 1980 as well as a remainder of days out of the current year. Well, let’s look at how the code plays out in my confusingly simple While Loop in Plain English™. Keep in mind that you’re never supposed to retry the loop without committing some sort of action which will ultimately exit the loop:
- year is set to ORIGINYEAR. Since ORIGINYEAR is 1980, any addition to year is added on top of 1980. This is fine because the hardware is passing the number of days since January 1, 1980.
- Is the number of days greater than 365? If so, proceed. Otherwise, skip to number 6.
- Is the current year a leap year? If so, proceed. Otherwise, subtract 365 from the number of days, add 1 to the number of years, and skip to number 5.
- Is the number of days greater than 366? If so, subtract 366 from the number of days, add 1 to the number of years, and proceed.
- retry number 2.
- …
- Profit?
The problem lies in the fact that there is one case which can never escape the loop. If it’s a leap year, the number 366 will stay within the loop forever because 366 will never be greater than 366, but 366 will also always be greater than 365. 366 days will pass through, in our Plain English While Loop, without having any action committed upon it.
There’s a very quick solution to this: Since 366 is a valid remainder of days in a leap year (December 31 is the 366th day in a leap year), 366 can technically exit the loop without any problems. All we need to do is add a way to exit the loop when the number of days is 366. Let’s see how things go with this new breakpoint:
- year is set to ORIGINYEAR. Since ORIGINYEAR is 1980, any addition to year is added on top of 1980. This is fine because the hardware is passing the number of days since January 1, 1980.
- Is the number of days greater than 365? If so, proceed. Otherwise, skip to number 6.
- Is the current year a leap year? If so, proceed. Otherwise, subtract 365 from the number of days, add 1 to the number of years, and skip to number 5.
- Is the number of days greater than 366? If so, subtract 366 from the number of days, add 1 to the number of years, and proceed. Otherwise, skip to number 6.
- retry number 2.
- …
- Profit!
Thanks to the break (represented by “Otherwise, skip to number 6”), 366 days can escape the loop properly. This would lead to the end result for December 31, 2008 being years == 2008 and days == 366.
This is what the break would look in the while loop:
while (days > 365)
{
if (IsLeapYear(year))
{
if (days > 366)
{
days -= 366;
year += 1;
}
else
{
break;
}
}
else
{
days -= 365;
year += 1;
}
}
Clarifying one quick thing: A number of people on outlets such as Digg suggested that changing “if (days > 366)” to “if (days >= 366)”:

The problem with doing this is that the end result, upon hitting “if (days >= 366)” with the value 366, one would end up with the year going up by one and the operation “366 –= 366” playing out. End result: December 31 of the leap year is suddenly January 0 of the following year.
Extra credit: This loop can be fixed such that the loop’s condition is based on the year’s status as a leap year. The change needed is simple:
while (days > (IsLeapYear(year) ? 366 : 365))
//The above line basically says “Is it a leap year? Check days against 366.
//Otherwise, check it against 365”
{
if (IsLeapYear(year))
{
days -= 366;
year += 1;
/*You’ll notice we no longer need a break. The loop condition now
*checks the days properly. Because the loop properly checks
*whether days is greater than 365 or 366, the second days > 366
*check is no longer needed. */
}
else
{
days -= 365;
year += 1;
}
}
However, if this change is made, IsLeapYear() is now being called twice. This is inefficient, especially on small devices (like the devices which would be using this driver). In order to reduce the number of times IsLeapYear() is being called, it would be best to store the value of IsLeapYear() in a variable.
int leap = IsLeapYear(year);
//leap is determined before the loop begins.
//In this case, it would be true (1980 is a leap year)
while (days > (leap ? 366 : 365))
{
if (leap)
{
days -= 366;
year += 1;
}
else
{
days -= 365;
year += 1;
}
leap = IsLeapYear(year);
//leap is determined again now that the year has changed.
}
This can actually be reduced even further. If you look at how the if/else statement is operating, you can see that it’s basically saying “If it’s a leap year, subtract 366. Otherwise, subtract 365,” which is similar to the while loop’s condition!
int leap = IsLeapYear(year);
while (days > (leap ? 366 : 365))
{
days -= (leap ? 366 : 365);
//days is equal to days - 366 if it's a leap year,
// days - 365 if it isn't.
year += 1;
leap = IsLeapYear(year);
}
If you wanted to reduce the number of times (leap ? 366 : 365) is checked, you could do this instead:
int daysThisYear = (IsLeapYear(year) ? 366 : 365);
//The number of days in the current year is now calculated instead.
while (days > daysThisYear)
{
days -= daysThisYear;
year += 1;
daysThisYear = (IsLeapYear(year) ? 366 : 365);
}
It’s questionable whether or not this code is any better than the original solution with the break, but this would be the “proper” way to do it (instead of using an arbitrary break in your code). It’s also easier to understand if you know C.
The irony in all of this is that this specific driver is used in multiple Windows CE devices, not just the Zune. This is why quality assurance is so necessary.
There are three morals here:
- Test every condition which could happen in every single loop in your code (though, as Matt Boehm points out in the comments, this might be a tad bit hard).
- Don’t visit Digg for programming advice. What you’ll hear is questionable at best
- Make your code easy to read for other developers.

Follow Bryant on Twitter!
Seems like the sort of thing that test driven development would have caught.
As this bug only occurs on the last day of a leap year, it went by undetected for quite some time. While there are quality assurance engineers whose job it is to look for bugs such as this, given the amount of code involved and possible configurations, it’s hard to test for.
There is a famous problem known as the halting problem which states that you can not create an algorithm which can correctly determine if a program given to it contains an infinite loop (for any such program.) While QA people can certainly write scripts to test code and review it, it is almost inevitable for bugs to exist in a shipped product.
As far as alternate coding styles, different coders will find different methods logical. The cost of a few operations for a processor is nothing compared to the cost of having a bug or having code which others find hard to read down the line.
P.S. If you really are conserved with saving cycles, the top line can be changed to “daysThisYear = 366;” (as 1980 is always the initial year and is a leap year.) That should trim about 2 nanoseconds off of the boot time
.
Eh, it’s usually considered a smart idea to avoid using constants in situations where something could change. If, for example, Freescale’s next hardware clock counts from the year 2010, it would be wonderful to not have to look for code affected by this change when reusing code.
Given that different epoch dates are used by different bits of hardware (as well as different bits of software), this warrants keeping the code future-proofed.
Very true. The alternate strategy would then be to define a constant DAYSINFIRSTYEAR right underneath where ORIGINYEAR is defined with a comment explaining that both must be changed together. However, this is indeed a case where for simplicity’s sake, the best path is probably not to worry about making the very most efficient code possible at the cost of legibility and bugproofing.
[...] the code that caused the Microsoft Zune 30 GB units to malfunction on New Year’s Eve was part of the [...]
Actually the **real** problem is that they don’t separate logic like that into subroutines. This logic is repeated in another routine in the same program; in which it looks like it was actually corrected to include a break statement:
//Calculate current year, month, and day use the value stored //in RTC day counter register OALMSG(OAL_RTC&&OAL_INFO, (TEXT("RTCDayCnt=%d\r\n"),day)); year = ORIGINYEAR; while (day > 365) { if (IsLeapYear(year)) { numOfLeap++; if (day > 366) { OALMSG(OAL_RTC&&OAL_INFO, (TEXT("Leap Year: %u"),year)); day -= 366; year += 1; OALMSG(OAL_RTC&&OAL_INFO, (TEXT(", Days left: %u\r\n"),day)); } else { OALMSG(OAL_ERROR, (TEXT("ERROR calculate day\r\n"))); break; } } else { OALMSG(OAL_RTC&&OAL_INFO, (TEXT("Not Leap Year: %u"),year)); day -= 365; year += 1; OALMSG(OAL_RTC&&OAL_INFO, (TEXT(", Days left: %u\r\n"),day)); } }So it’s likely they hit this bug in testing; and corrected it in the one place and not in the other.
comment edited by Bryant: added pre tags to the code sample.
> Seems like the sort of thing that test driven development would have caught.
Only if one was smart enough to include this edge case test.
@iamacyborg:
Well, they screwed that up too. They’re putting out an error message whenever it’s the 366th day of the year.
Software for time computations should use the *Modified Julian Day* [1] which provides a continuous time measure. You can find implementations of the conversion algorithms in astronomy software (including even the Gregorian calendar reform for those really old music files).
[1] http://en.wikipedia.org/wiki/Julian_day
[2] http://en.wikipedia.org/wiki/Gregorian_calendar
i think your last optimizations are superflous. most compilers (msvs & gcc) already do them automatically (loop conditional unswitching, constant & copy propagation).
[...] blog post explains the error in both code and “Plain English”, so check it out, it is very interesting. He also goes into how he would make the code [...]
int leap = IsLeapYear(year);
while (days > (leap ? 366 : 365))
{ … }
huh?
This “solution” breaks the function. IsLeapYear() has to be called once per pass because it returns a different value depending on what year it is…
Otherwise you’re just going to pass IsLeapYear(1980) each time… Which I’m sure the compiler will happily optimise out entirely for you…
int daysThisYear = (IsLeapYear(year) ? 366 : 365);
//The number of days in the current year is now calculated instead.
while (days > daysThisYear)
{
days -= daysThisYear;
year += 1;
daysThisYear = (IsLeapYear(year) ? 366 : 365);
}
This code fails: isLeapYear should vary with year, which changes as the loop progresses. Impressive, to write an article criticizing MS for their bugs, while introducing some of your own…
Andrew, check the code again. Look at this line:
That line checks the newly incremented year (incremented in the line before it) if it’s a leap year and sets daysThisYear accordingly. Example:
-1980 is checked before the loop starts. It’s a leap year, so daysThisYear = 366.
-Loop starts. 366 is subtracted from days.
-year is incremented by 1
-daysThisYear is reevaluated with the new year. 1981 is the new year and is not a leap year. Thus, daysThisYear is now 365 before the loop begins again.
Hmmm, I haven’t programmed C in quite some time, but couldn’t you just change the while to an if? I do not see the reason for using loop construct in this situation.
While I agree with Matt Boehm to a certain extent (actually, almost all of it), engineers are trained to test boundary conditions. Which is why I say to Michael Campbell, any test engineer worth their weight should have realized there are 4 particular boundary conditions that should be tested, namely the the first and last days of “normal” years and leap years. Hell, with a testing environment, it doesn’t really cost you much to set it running, testing each and every day from 1 Jan., ORIGINYEAR, on until eternity, while you clock out for the day. Not only are you able to then test 30 day months vs 31 day months, but you get 29 Feb and 31 Dec leap year for free. In this case, you come back to your computer and you see it’s stuck in an infinite loop, and since it’s in a testing environment, (you did use a testing environment, right?) you know that it’s the 31 Dec on a leap year, or, if you’re lucky, your testing environment has detected that each 31 Dec is (likely) an infinite loop and continued on testing the next 4 years without problems.
@Brandon, in my post, I noted that the loop is “designed to get the number of years from the number of days since 1980 as well as a remainder of days out of the current year. ”
In other words, there will be a healthy number of days to break down. It’s not just 365, 366, or 367. It’s going to be in the thousands.
> Only if one was smart enough to include this edge case test.
In a TDD environment it’s not that the developer would have to remember to test this case, it’s that he wouldn’t have added the code for this case without first having written a test which failed. Test _first_, implementation _second_.
OK, I see now. Thanks.
@Brandon:
The reason is that the Zune stores the days since epoch (in this case, 1 Jan 1980) as an integer. That means that 1 Feb 1982 would be 31+1+366+365=day 763 for the Zune. Using an if instead of the while, it would think that it’s always 1981 for every day after 31 Dec 1980, which for 366 days in all of eternity, would be fine, and then for another 59 days but shifted 2 days of the week, but for every other day (infinity vs 425), it would be screwed up immensely.
A better solution would be not to loop at all, and to instead use division and modulus to do the computation. The ought to read calendrical calculations by Nachum Dershowitz and Ed Reingold and learn how to do this properly.
If you are really concerned about efficiency, you can actually get rid of the loop altogether and replace it with a couple of expressions involving modules and divisions. As for optimizing away the first call to IsLeapYear, well, replicing this funcion with a one-line macro or inline function should be enough to get the compiler to do it for you.
@Andrew
In the code you quoted isLeapYear does vary with each year. What’s the bug?
In 2008, this iterates through 28 cycles every time it’s called. Why didn’t they compute a table of, say, 50 base years from 2000 to 2050 and use bsearch to find the answer? Also, IsLeapYear should start with if (Year & 1) return 0;
Who wrote this lousy code?
since that code is under shared source, where were the opensourcies on that one!!
@Jonathon: Yeah, I agree with you there: of course, who would expect any of these types of devices to last 50 years?! But 50 years x 365.26 days/yr is going to add up to a lot of memory: 18263 distinct values, which if you use SHORT’s to represent the days, still is over 35K of memory for this. It’d be better to use a function.
Jonathan,
Seems to me you are creating the basis of a Y2051 bug.
The above example is why crusty old embedded systems programmers are wary about while loops.
IHO the code below is more straightforward.
while (1) // forever!
{
if(IsLeapYear(year))
{
if(days <= 366)
break; // exit stage left
days -= 366;
}
else
{
if(days <= 365)
break; // exit stage right
days -= 365;
}
year += 1;
}
or even better.
while(1)
{
if(IsLeapYear(year))
daysinyear = 366;
else
daysinyear = 365;
if(days <= daysinyear)
break;
days -= daysinyear;
year++;
}
Both are less confusing and easier to follow. The second case probably would be less likely to be ‘corrected’ by a multiple exit hate’n weenie.
I’m not an expert or anything but shouldn’t the loop be eliminated altogether? I personally don’t see a reason to have that loop at all when you could instead have a series of mathematical calculations which would figure out the number of years and days. Something to the effect of:
ConvertDays(UINT32 days) {
years = floor(days/365);
curDate = ((((days/365)-years)*365) – floor(years/4));
curYear = 1980+years;
}
This such a 80’s issue. Why is code still being written to do a simple function poorly.
First mistake is choosing 1980 as a start year, instead of 1981. It is simpler to convert from date to days and back when the base is on a start of a leap cycle. Then simple 28.25 days for February, or 365.25 for year would work. Also it is better to only use 1 count of days (31, 28.25, 31.. vs 365.25) since it is easy to make mistakes.
Oh, when I write 28.25, a really mean 2825 in a UINT. This 100 times to big so final answer is always / 100.
To the person that wrote this gem:
“As this bug only occurs on the last day of a leap year, it went by undetected for quite some time. While there are quality assurance engineers whose job it is to look for bugs such as this, given the amount of code involved and possible configurations, it’s hard to test for.”
Please enver apply to General Dynamics. You are not wanted. This bug is pathetic.
The problem is that the code looks very simple so unit testing was never done. The problem is that in unit testing, this kind of bug that is so simple to catch that it should have never been missed. This is not the test group that failed. This is the designers and coders that F’ed up. Somewhere a software lead lost his wings today and deservedly so.
No, no, no! You musn’t give code to Microsoft!
Remember that OpenSource is the scourge- let them pay to figure out their problems. More people today have seen code than at Microsoft all year, I’ll bet- and not one of them was paid. Charge them millions for this fix…”it’s what the customers want!” :>
Let’s not run to the rescue with solutions- not for free, anyway. Free solutions “can’t be any good” and “who would provide programming for free?”.
:>
@Crash: not so simple, Years % 100 are NOT leap (but years % 400 are).
@Sineibe, check that code snippet again. The same line is called once inside the loop as well.
Wouldn’t a static analysis tool on the original code be able to trivially alert the programmer that there was a path through the while loop that did not change the conditions of the while loop?
@crash
multiplication and division are expensive operations. I don’t know if the loop is more or less expensive overall though. Also I don’t think the floor is necessary since integer division would be used anyway.
@Gerardo: I’m not understanding why years%100 isn’t but years%400 is. Are you sure? If they happen every 4 years and every year that’s divisible by 4 is a leap year then if years%4 = 0 it’s a leap year so both years%100 and years%400 should be leap years.
[...] first time for the bug to show itself. There are descriptions of the bug in numerous places: here is one if you haven’t seen it yet. Bottom line: a section of code for converting “days [...]
A bug is a bug, except when it’s made public. Anyone who’s ever worked in a software house knows that bugs will exist in software forever. No software is bug-free. No software is ever “finished”, except when a simple bug like this gets into the hands of the public, then SOMEone must PAY! Poppy-cock. Too bad us humans write software. If we didn’t, then maybe, just maybe… it would be perfect.
Umm… While the Extra Credit optimization is valid, each of the three below it is not. Each of them evaluates “year” outside the loop, then increments it inside. In the first version, while we do call IsLeapYear(year) repeatedly, with the exception of the first pass through the loop, each time we call it, we have a new value of “year” to check. In each of the others, we have a stale value of IsLeapYear(year).
It is faulty optimizations like this that cause this sort of problem. Be not so proud of thine coding expertise, youngling.
Foo. My apologies. You do actually recalculate that at the bottom of the loop, where I missed it.
Charles W, don’t fret. Andrew and Sineibe missed it as well, so you’re hardly alone.
@John: I just realized my second line can be simplified to use a modulus so in that case it should read
ConvertDays(UINT32 days) {
years = floor(days/365);
curDate = ((days%years) – floor(years/4));
curYear = 1980+years;
}
Would this lighten the load versus the while loop? Also, why would integer division be used?
Crash, read the wiki on leap years: http://en.wikipedia.org/wiki/Leap_year
Because the there are not “exactly” 365.25 days in a year (which would benefit the year % 4 concept only), then there are exceptions. Every 4th year, except every 100, except every 400, except…. (TBD). The number of days in a year are (to the best of human knowledge), 365.24219, 365.2424, or 365.2425… and all of these are averages. Also, the number of floating point days per year in increasing each year. So, these are only current day rules. Who knows what the future will bring. If only the universe were a digital one.
For fun, read the wiki on leap seconds. Try getting your mind around that one. The super-big science guys just added one this new year’s eve. Did you feel the hiccup in the space-time continuum?
Charles.. “Be not so proud of thine coding expertise, youngling.” lol (Nobody shakes head at the human race)
@Nobody: Thanks for turning my world inside-out. I had heard about the leap second thing recently but I wasn’t aware of leap years being so ridiculous. I tip my hat and call it a day. Let somebody else figure out pi — i mean leap years.
@Crash
Every 4th year is a leap year, unless it’s every 100th year which is not a leap year, unless it’s every 400th year which is. Thems the rules.
1896 yes
1900 no
1904 yes
….
1996 yes
2000 yes
2004 yes
because even 365.25 days per year is not the true value.
@Crash.. sorry ’bout that. Happy New not-a-leap-Year! I was working on a time calculator/converter a few years ago when leap seconds sank my boat. I’ve come to realize that it’s these digital-meets-analog encounters that keep everything so interesting.
>> int daysThisYear = (IsLeapYear(year) ? 366 : 365);
ugh just add the boolean!
int daysThisYear = 365 + IsLeapYear(year);
forcing an if/endif check is inefficient.
as fas as I know, this is the calcualtion for leap years.
int IsLeapYear(int year)
{
return ((year % 400) == 0) || (((year % 4) == 0) && ((year % 100) != 0));
}
@dlesko
Okay, wow. I didn’t think of that at all. Well done.
@Nobody: As long as you enjoy chasing the leap second then have at it! At least I learned something today. I’m sure it’ll come in handy in the near future. Your shattering of my reality wasn’t a deterrent but a learning experience. Thank you.
@Super Francis: I was simply trying provide an explanation of how such a bug could slip by undetected for years after launch if not caught early on while programming. I’m by no means trying to act as an apologist for Freescale. I was mentioning this in the context of Microsoft coming under fire for a bug in a third party driver. Any programmer worth half his salt knows to test his code as he’s writing it and not leave bugs for QA people to catch (or not, in this case.) I’m agreeing with you that this is the result of careless programming. Sorry to have mis-communicated my meaning to you.
I must say that I disagree that this condition would have been hard to test for. The fact that the developer(s) hard-coded in these values knew that there were edge-conditions that they were looking for and it is good development / QA testing practice to check all known edge-conditions. Since the code included a test for when the number of days is > 366, a reasonably prudent developer would have wondered what the code would do when the number of days == 366.
instead of “int daysThisYear = (IsLeapYear(year) ? 366 : 365);”, which I can only assume requires a compare why don’t you just write:
int daysThisYear = 365 + IsLeapYear(year);
More efficient and easier to read (until somebody decides to do something dumb to the IsLeapYear function).
@Jim
That’s what I’m afraid of. I’m afraid of someone altering IsLeapYear to return non-0/1 integers for whatever reason, which is part of the reason for why I didn’t update the main post.
There’s a lot of talk about what the *real* problem is. Well, the *REAL* problem is that someone wrote a function to calculate the number of days since a fixed date, and wrote this function from scratch instead of using an existing function that has been tested for DECADES.
you could do >= then then you’d have to change the 366 to 367
Or rather their converting a number of days to years, but either way…
While I CAN program, I am NOT a programmer, therefore my comment may be stupid or unreasonable, but my take on it is that all computer languages (except assembler maybe) should have this functionality built-in, so your code would just be ‘If Date(today)=LanguageName_LeapYear Then…’ My reasoning is that this problem comes up all the time in so many situations that it should be in the language. If you think about it, the Y2K bug was just another example of the same issue.
[...] Zunes decided to take New Year’s Eve off (http://www.aeroxp.org/2009/01/lesson-on-infinite-loops/) [...]
who are these QA entities of which you all seem to speak? The small organization (the FAA, a part of the DOT) in which I toil ceaselessly (yet somewhat pointlessly) the letters ‘Q’ and ‘A’ have been banned from usage by order of Homeland Security.
as for leap year calculations, we simply call out to the mainframe for the current year’s Feb. 28th date. Then we add 1. If the month is still Feb, then it’s a leap year. If it’s March, then it isn’t. hey, don’t complain. at least we don’t have to write code for machines with vacuum tubes anymore.
@Bryant
int daysThisYear = 365 + !!IsLeapYear(year);
The ol’ double not should do the trick. Anything nonzero becomes a 1. I think it can be reasonably assumed that the function returns 0 if it is not a leap year, and some nonzero value if it is. Otherwise, it breaks any possible testing conditions you could apply to it.
We dogfood our code on our 70,000 employees, plus have massive beta releases, so normal testing is not really necessary. Just this one time it didn’t work (the beta didn’t last long enough).
int daysThisYear = 365 + IsLeapYear(year);
Is bad programming practice and could easily get you screwed.
IsLeapYear() is defined as returning a bool, which in C is 0 or non-zero.
IsLeapYear() could easily return something other than 1 for true (Often I have seen -1; all bits set in bool results) which would really screw with you.
The other point is that bools are not integers and should not be used in arithmetic. With C you will get away with this
most of the time but then you have to trust a) the developer of the function and b) the compiler/library platform you’re
compiling on (things often vary).
A library developer on some random platform might use the inverted last result of a modulus as a return value for example: this would
reduce his code and be a valid return value for the bool result. Problem then is people using that bool in arithmetic would see 0, -1, -2 & -3
as the return values (depending on the CPU architecture).
@Bruce:
Actually, while you’re right in that IsLeapYear’s returned value is intended for use as a bool, it’s not actually returning a boolean. However, this doesn’t actually affect the fact that if IsLeapYear is changed to return something other than 0 or 1 (which is all it’s currently returning), the year counting code would be adversely affected.
[...] Z2K 的出現成因。原來是 Zune 處理閏年的程式碼使用了一個錯誤的 While loop [...]
@Jim, @Yale, @Bryant:
Notwithstanding Bruce’s point about errors when using bools in arithmetic, premature optimization is the root of all evil. Code for readability first. Optimize when your profiler shows you an underperforming algorithm or excessive/lengthy resource locks.
A modern optimizing compiler will reduce multiple calls to the same function with the same input to a single call by storing the output in a register and reusing it. Due to pipelining, modern processors won’t trip about a local jmp generated by a comparison.
@iamacyborg
That is a good point. It would reduce redundant coding if they put that loop into its own function.
yeah my point, though badly stated, is that relying on the non-defined behavior of a library on your particular platform is bad practice.
IsLeapYear() might actually return 0 and 1 in your libraries, but who’s to say it will return the same in a library in another system;
the only definition we have is that it returns bool.
[...] say, but the popular assumption is tThere’s a bug caused by the 2008 leap year – here’s a wonderfully technical explanation and plain English translation of the bug. For reals?!? I guess that means MS has four years to issue a real [...]
I am a very conservative coder. Books such as “Writing Solid Code” have enhanced my conservatism a lot. What strikes me right away is that the use of the “while” loop requires the code inside the loop to modify the value of an incoming parameter violates my sensabilities. This has always been a big “no-no” for me. Copying the incoming parameter to a local variable would have been better, but this code is what I call, “Tricky Code”. Normally the rules coders live by include severals items. Incoming parameters first, pointers to outgoing variables second. I am totally surprised to see code modifying the “days” variable. I acknowledge that “days” is a value parameter as opposed to a reference parameter (pointer) and the chance of the calling routine checking the value of the incoming parameter “days” in the stack frame upon return is unlikely. Neither myself nor any software engineer I know would have written the code this way. Because optimizing compilers are so good these days, writing the code in a tricky manner is no longer required and it could have been done in a much more understandable way. The lack of in-line documentation explaining the code’s functionality reeks of “the code is the doc” mentality. If this is what Microsoft code looks like these days? That would go a long way to explaining why things are so dicey bug-wise with their products.
I would be interested to know what other software engineers think about code that modifies stack frame parameters. It seems like something that violates our basic coding rules in c/c++. Am I incorrect? Is this considered all right these days. I am shocked, but if I am wrong I want to know…
It’s unfortunate that someone decided to invent their own routines to do this. Frameworks like .NET, Java, ICU and JodaTime don’t use loops to do this type of Gregorian time calculation. O(constant) will be faster than O(n), especially after a few years. Also those frameworks are much more thoroughly tested and used much more widely.
Given that the calendar doesn’t change, why not just use a table with the number of (or a running sum) days for each year. One int per year, a simple loop of compares. It’s bound to take fewer bytes that the given code does. It’s silly for every device to recalculate a constant.
Guys, guys! You are all missing the point. This was never a bug to begin with. I have it on good authority that the developer was told that the Zune 30 would not be used after Dec 31, 2008.
On a more serious note, it is apparent to me that the Zune 80 and 120 had the fix already or the issue would have occurred on them as well. They use the same processor (the Zune 30 processor is actually 125Mhz faster than the 80/120). Since the hardware is essentially the same with only minor tweaks to increase battery performance it is safe to assume that the firmware is based on the same code. If it was based on the same code then the bug was fixed in the 80/120. Soooooo what we have here is poor internal communication at MS and a general lack of concern for the consumers. Which is why I have an iPod.
The irony is this function can really be reduced by using Microsoft’s own APIs.
1) Take your epoch year, 1980, and make it into a SYSTEMTIME structure (Trivially January 1, 1980)
Convert that LONGLONG back to a FILETIME, again, either by a trivial cast/memcpy() or direct assignment of the parts.
2) Take that SYSTEMTIME, make it a FILETIME using SystemTimeToFileTime (I believe)
3) Take your number of days, convert it to a 64-bit number (LONGLONG)
4) Multiply that 64-bit number by 86400 to convert days to seconds (24 hours * 60 minutes/hour * 60 seconds/hour)
6) Multiply that product by 10,000,000 to get the number of 100-nanosecond periods. (a FILETIME increments every 100ns).
6) Convert the epoch FILETIME to a LONGLONG either by cast, or by trivial assignment (highpart to high part, low part to low part)
7) Do 64-bit addition (the ARM compiler supports it) of both LONGLONGs together.
9) Convert that FILETIME back to a SYSTEMTIME, and hey, there’s your return value.
The same solution, using the OS to handle the nasty date bits, and relying only on well-known constants, simple multiplication, division, and assignments.
The leapyear check is a one-liner, fercryinoutloud. Here it is…
LEAP = SIGN(MOD(IYEAR / ( 100 – 99 * SIGN(MOD(IYEAR, 100), 4)))) will return 0 for leap years, and 1 for non-leap years.
Not tested, but here’s the logic, from the innermost numbers out
MOD(IYEAR, 100) will be zero for a century year (integer divisible by 100) or a positive number for a non-century year
SIGN(MOD(IYEAR, 100)) will be zero for a century year (integer divisible by 100) or 1 for a non-century year
100 – 99 * SIGN(MOD(IYEAR, 100)) will be 100 for a century year (integer divisible by 100) or 1 for a non-century year
If the year is a century year, you’re dividing by 100 first, otherwise take the year as is.
MOD(IYEAR / ( 100 – 99 * SIGN(MOD(IYEAR, 100), 4))) takes MOD 4 of the year (or YEAR / 100 for a century year). This implements the rule of divide by 4 if not a century year, or 400 if a century year.
The outermost SIGN forces any positive MOD result to 1, and 0 stays as 0. Easy, isn’t it? And now for some computer humour…
(read . (my . (lisp . (no . (new. (taxes))))))
[...] fundamental problem I was browsing an analysis of the Zune 30’s 31 December 2008 bug, and I found myself wondering where it was going. See, [...]
@Tim:
No! Code for readability last, or not at all. You should endeavor to write your code such that someone trying to modify it needs to be at least as clever as you are in order to understand it. Otherwise, you end up with some dope down the line who makes what he thinks is an innocuous change that really ends up sending the program into an infinite loop in some quite foreseeable set of circumstances. Ideally, if your code is modified by someone who has no idea what he is doing, it should completely fail to compile. Barring that, it should immediately segfault, spit out screenfull upon screenfull of garbage data, or just terminate silently with no error message. Particularly vindictive coders will cause it to overwrite a database or two, as punishment for trying out modifications on a production system.
@Yale:
Number one mistake in any IT or Developer career: trying to secure one’s career by making things complicated for anyone other than you.
Always code for reliability. You can use tricks, but don’t obfuscate crap and make sure you comment well if you’re using a technique which can’t be intuitively deduced in less than 30 seconds.
@Bryant:
Good thing I’m not in IT, eh?
That was intended as tongue in cheek, actually. But I still stand by my double not as a way of filtering the return value. It’s either a good way of making sure that the value is always something you expect, or a good explanation why you should never try adding a true/false return value to an integer in the first place–I’ll let the reader decide which.
I think it’s funny that several people are offering their “corrected” versions of the code, some probably having bugs, without testing them. The moral should be clear by now: it doesn’t work until you’ve run it through a unit test. And beyond this, anyone trying to optimize the code, given this context, needs to go back to square one.
@CJ
The offending code was from a device driver for a RTC. I think the reason why only Zune 30’s were affected by this bug is because the other models did not use that particular piece of hardware. So the small differences between hardware that are essentially the same made all the difference in this instance.
@CJ
”
“If it was based on the same code then the bug was fixed in the 80/120. Soooooo what we have here is poor internal communication at MS and a general lack of concern for the consumers. Which is why I have an iPod.
Except you haven’t proved it as true yet.
Simple 1-line fix:
365)
> while( days>366 || ( days>365 && ! IsLeapYear(year) ) )
this gives some efficiency — The leapyear test is only done on one day of the year (and only on non-leapyears) and, even then, only on the last iteration of the loop. It also avoids the break statements (which many programmers reserve for emergency uses only).
Your last change (the DaysThisYear) will result in bad counts for leap years.
@Stephen Samuel:
From a QA perspective, the outrage is that the original loop clearly lacks a variant. A variant is a variable (or abstract expression) which progresses relentlessly toward a fixed termination condition. The variant must progress on *every* iteration. Days is the only variable in the predicate here, so clearly any iteration of the loop in which days fails to decrease is a huge problem.
The other glaring defect is failure to comment the post-condition which appears to be that 0 < days && days <= days_in_year (year). What happens if days is passed into this function as 0? Does the day of week calculation get screwed up later? Day of year is one origin, but the month_tab is zero origin?
I took one look at this and said to myself “where’s the variant?” without even beginning to think about the math or the logic. Termination and correctness are distinct concerns.
No form of this loop should allow a modification to days in a guarded (aka conditional) expression. Days must decrement on *every* iteration, and rightfully this should be an unconditional statement toward the end of the loop body. Until the form of the predicate is changed, any other objection is thinking too hard.
The funny thing is, if you insist that junior programmers define their pre-conditions, post-conditions, and variants, they become motivated to discover that well-tested code already exists.
Check out the article “The Cruelty of Really Teaching Computer Science” on Wikipedia. It doesn’t need to be that extreme, but if you ignore it completely, this is the mess that results.
I’m shocked that the programmer didn’t regression-test his subroutine. According to other parts of the code, the algorithm only has to work from 1/1/1980 through 12/31/2080. It would have been very easy for the programmer to test the subroutine against all possible allowable inputs as follows, and it would have detected the bug.
SYSTEMTIME st;
main ()
{
int i;
for (i=1; i<50000; i++)
{
ConvertDays (i, &st);
printf (”%6.6d %2.2d/%2.2d/%4.4d\n”, i, st.wMonth, st.wDay, st.wYear);
}
}
I used to write these sorts of test frameworks day in and day out when I was coding IBM 370 assembler. I found a lot of bugs that way.
Douglas Goodall I would be interested to know what other software engineers think about code that modifies stack frame parameters. It seems like something that violates our basic coding rules in c/c++. Am I incorrect? Is this considered all right these days. I am shocked, but if I am wrong I want to know…
I don’t really care about modifying stack frame parameters. Though I tend to avoid doing it as it’s useful to have the original parameter around when debugging. As for it causing unwanted side effects? No, anything passed by value’s scope in C is confined to the function.
Really though the actual problem here is historical, which is traditionally RTC chips keep time in MM:DD:YY HH:MM:SS. A better way of handling this is for the clock chip to keep time in seconds since whenever and let the well tested operating system routines handle conversion to monkey time.
I just realized what was bothering my about the code. I write mostly device drivers, and “while” loops without timeouts are a sure sign of a quick and dirty proof of concept driver. Well implemented drivers get timeouts on all loops so that if conditions don’t change, you don’t get hung up. I realize a date calculation wouldn’t normally require a timeout, but the “while” construct should put you on your guard… And I agree there should have been a modifier towards the bottom of the loop that would eventually break the loop. All of this leads towards what we wish we had which is provably correct code. I wonder if an advanced compiler could have protected against the infinite while condition and thrown an exception of some kind after some extreme number of loops. I generally think strongly typed C at the highest warning level does what you want if you can get it to compile without warnings. Should the compiler have complained about the risky control structure? On the other hand, the driver interfaces could have detected a hung thread and failed the driver initialization. Could this failure have been detected at a higher level and merely failed to set the correct system time/date but otherwise continued operation?
There appears to be a certain amount of muscle flexing regarding how to “best” or most efficiently implement the algorithm. Here’s what I’ve come up with in a few hours. It has advantages (shockingly efficient, no lookup tables, no loops) and disadvantages (unaddressed Y2100 bug).
The key is to make Feb 29th the last day of a repeating 1461 day cycle by resetting the origin from 1/1/1980 to 3/1/1976 and to proceed from there …
#include
main()
{
int i;
for (i=1; i<43892; i++)
{
printf (”%6.6d “, i);
dateconv(i);
}
}
int dateconv(int days)
{
int year, month;
days += 1400; // Reset origin to 3/1/1976
year = days / 1461; // Factor out 4 year cycles
days = days % 1461;
// Compute days into the current year and adjust year
year = year * 4;
if (days < 365) { year += 1976; } else
if (days < 730) { year += 1977; days -= 365; } else
if (days < 1095) { year += 1978; days -= 730; } else
{ year += 1979; days -= 1095; }
// Compute the month and day, adjust year for Jan/Feb
// Also change day origin from 0 to 1 by adding 1 to days.
if (days < 31) {month = 3; days++; } else
if (days < 61) {month = 4; days -= 30; } else
if (days < 92) {month = 5; days -= 60; } else
if (days < 122) {month = 6; days -= 91; } else
if (days < 153) {month = 7; days -= 121; } else
if (days < 184) {month = 8; days -= 152; } else
if (days < 214) {month = 9; days -= 183; } else
if (days < 245) {month = 10; days -= 213; } else
if (days < 275) {month = 11; days -= 244; } else
if (days < 306) {month = 12; days -= 274; } else
if (days < 337) {month = 1; days -= 305; year++;} else
{month = 2; days -= 336; year++;}
printf (”%2.2d/%2.2d/%4.4d\n”, month, days, year);
}
Note that on certain machine architectures, a single divide instruction can produce both the quotient and remainder, which could be exploited by an assembler version of the above algorithm.
Gibbon1:
I see your point. Though changing the input parameter values subjects complicated code to the same kinds of trouble that global variables can lead to. Once the routine becomes more than a page long, you might fail to notice code that changed the input parameter and a bug might have been introduced that way. I am sure that is why I learned not to do that.
I think the reason that RTC chips kept time that way because they were used in watches and clocks and the outputs were sent directly to display logic in BCD. For really simple logic implementations of watches and clocks, the conversion to human time would have required time zone information and national date order preferences DDMMYY vs MMDDYY…
The best code :
while((days – (isLeapYear(year) ? 366 : 365)) > 0)
{
years ++;
}
you can test the code above, it would work well!
This is again a case of reinventing the wheel, poorly. For some reason this seems to happen all the time with time and date calculations! (see old RISKS digest archives, which report problems like this every time there is a leap year). A little search in literature or even over the net should have found some elegant, efficient (no loops!), and correct functions for converting to and from “Julian days”, a day count notation used by astronomers. It has a base date very far in the past, but rebasing to 1980 would have been just an extra addition operation. I have myself used this approach succesfully.
@Oldboy
days needs to be adjusted as time goes on too. It’s not just years which is being changed. (Given the nature of your code, it’s safe to say your code won’t work in this situation.)
You can make the short solution even shorter by taking advantage of the fact that IsLeapYear returns precisely ‘1′ for true or ‘0′ for false.. which happens to be the exact number of extra days in a leap year:
int leap = IsLeapYear(year);
while (days > (leap + 365))
{
days -= (leap + 365);
//days is equal to days – 366 if it’s a leap year,
// days – 365 if it isn’t.
year += 1;
leap = IsLeapYear(year);
}
And any compiler in the world will only do the calculation once and use it for both the comparison and the subtraction, but you could always cast it as
int yearlen = IsLeapYear(year) + 365;
while (days > yearlen)
{
days -= yearlen;
//days is equal to days – 366 if it’s a leap year,
// days – 365 if it isn’t.
year += 1;
leap = IsLeapYear(year) + 365;
}
if you like.
Another issue with this routine is its definition. It is defined to return a BOOL when in fact there is no information to be returned.
There is no failure mode, so returning TRUE or FALSE is not necessary. This routine should be defined as void.
Jan: SharedSource != open source. Shared source has varying restrictions that make it useless for most open source uses. Among other things, you’re not allowed to recompile most shared source code and use the fixed code in production (some versions of shared source don’t even allow you to compile the code at all).
Most Open Source developers throw up just reading the license and never get as far as looking at the code.
If the code has access to the POSIX time functions, this code can be used instead. It might not be the fastest code, but it is quite readable.
#define EPOCH_OFFSET 315532800 /* 1980-01-01 00:00:00Z */
BOOL ConvertDays(UINT32 days, SYSTEMTIME* lpTime)
{
struct tm *tm;
time_t now;
now = EPOCH_OFFSET + days * 86400;
tm = localtime (&now);
if (tm == NULL)
return FALSE;
lpTime->wDay = tm->tm_mday;
lpTime->wDayOfWeek = tm->tm_wday
lpTime->wMonth = tm->tm_mon + 1;
lpTime->wYear = tm->tm_year + 1900;
return TRUE;
}
[...] them down was “released” on the internet, and a guy “Bryant” wrote an extended article about the issues with the code. As a followup, a nice and instructive discussion [...]
And if we don’t have access to the POSIX time functions we can use this:
for (;;) {
int yeardays = IsLeapYear (year) ? 366 : 365;
if (day <= yeardays)
break;
day -= yeardays;
year += 1;
}
What’s wrong with this for the while loop?:
while(days > 365) {
if(IsLeapYear(year)) {
–days;
}
days -= 365;
++year;
}
Whoops. I left out a continue… I meant to say:
while(days > 365) {
if(IsLeapYear(year)) {
–days;
continue;
}
days -= 365;
++year;
}
And unsurprisingly, I just realized that’s not going to work. And I also now realize this solution is the same as the original post’s. But for completeness, and to assuage my OCD:
while(days > 365) {
if(IsLeapYear(year)) {
if(days > 366) {
days -= 1;
}
else {
break;
}
}
days -= 365;
++year;
}
I find this discussion (especially the opinions) to be very interesting. I haven’t coded anything in a LONG time, but I have a question about the use of a while loop, Jack’s proposal to use 28.25 and 365.25, and Bryant’s response due to 100/400 leap years.
Since the Zune 30 was produced in 2007, why not just begin the calculation at 2001 instead of 1980, allowing 99 years until the next leap-year skip, well beyond the life of any consumer product? You could then use simple math and integers once to determine the number of days. I know there’d be a y21k bug, but no Zune is going to be in service then.
As I said. It’s been a long time, so I’m probably missing something. What is it?
[...] all the non-coders out there, here’s what happened. According to Zadegan, the code is designed to calculate out the year by looking at the number of days that has elapsed [...]
@jan:
In response to “since that code is under shared source, where were the opensourcies on that one!!”
It’s a *zune*. Like anyone cares!
But hey, given that we can actually see the code, we can all chuckle to ourselves and see why it went wrong. Tried that with a closed source binary blob. A bit hard to cover up source code you can *see*.
re: Digg and programming advice — this ties directly to my theorum of political programming ability: “The more liberal a group of programmers, the lower the quality of their code.”
The average length of a year is 365.2425 (365 + 1/4 – 1/100 + 1/400)
year = floor(day / 365.2425)
[...] the Aeroexperience quality assurance blog explained this past weekend, the sofrware was, “designed to get the number of years from the [...]
[...] La source de la solution que je décris (et une floppée de commentaires intéressants derrière!) [...]
As I have mentioned in my post earlier (http://www.kerrywong.com/2009/01/01/what-happened-to-the-testing/), I think what is most concerning is that such mistakes are overlooked by peer reviewers…
Did someone say use floor? Floor is notorious for having major rounding issues. It is *not* reliable..
@Daeng Bo
Most lower-level hardware providers use the same epoch for the sake of compatibility. 1980 is the most frequently used year, and it’s more attractive when the epoch stays the same (less time rummaging through past code by the developers in order to find what to change).
I am surprised how many people accept that looping is an acceptable way to solve this simple problem. I would give any programmer an “F” that would attempt to calculate leap years using loops like this. Sadly, most computer science students are poor at math.
Hell, first why did they choose 1980 as the start date — that’s a lot of cycles wasted for a product that didn’t exist till 20 years after.
Second, given a maximum lifespan of the product of, let’s say, 40 years that leaves us dealing with a total of 11 leap years so let’s just play it safe and say 15 leap years. Worse case of storing 15 leap years would be in utf-8 or ascii (I would use unsigned 16-bit int for both size and speed but that’s me) At 4 bytes per year that would be a total of 60 bytes. Thus we could use a 60 byte lookup table which would be faster than computing leap years and save code space at the same time. The code given would be larger than the lookup table and lookup code combined (plus they probably can reuse some look up code already there) and easier to verify — read the table and compare. Validate the look up code by writing a simple piece of code to look up leap years using what ever lib your stuff is in.
P.S. Bryant, just because “most” lower-level hardware providers use the same epoch doesn’t mean that it should be used without current justification. Kind of reminds me when I go rafting. The first person in a line of rafts goes through a rapid and flips. The second follows the first and flips. The third, fourth, fifth, etc all follow along and run into the same troubles.
I prefer engineers/designers/architects that learn from the past — what was done right and what was done wrong — and can justify how what they have chosen applies to the current task at hand.
Interesting discussion. It’s pretty funny how the original “anonymous reader” post on Slashdot is quick to lay this at the feet of the QE team – quoting here “…from a coding/QA standpoint, one has to wonder how this bug was missed if the quality assurance team wasn’t slacking off… ” – yet the ensuing discussion with the code actually provided fails to arrive at a mutually-agreed solution… much less one that is actually produced and provided to the QE team, tested and released, on-time and within budget. Lots of moving parts here, folks. I bet the Zune QE team is having a few extra cups of coffee with their Advils this week.
One liner:
for (int daysThisYear; days < (daysThisYear = (IsLeapYear(year) ? 366 : 365)); days -= daysThisYear, ++year);
I guess I meant >=.
[...] [1/5/2009]: Here’s some pretty detailed confirmation that it was indeed an infinite loop error. I know my crashes [...]
leap = isLeapYear ( year );
if ( leap ) and ( dayOfYear > 366 ) then
nextYear ( year ); { procedure to roll over to next year }
@Jon Sebastiani
Depends on language obv, if C then can work out if need to decrement or increment easily enough. Infact thats how Mozilla’s javascript engine does it.
I for one find it hilarious that people are spending this much effort talking about optimizing a piece of code which is rarely ever run on the device. Premature optimization, and all… It shows how bugs like this exist in the first place.
@XS “I for one find it hilarious that people are spending this much effort talking about optimizing a piece of code which is rarely ever run on the device.”
It runs every time the device boots. What are you talking about?
[...] Michael from work points out that you can see the actual code from the bug! Amazing! I never thought I’d see this. Maybe I just don’t read the right [...]
its amazing to see proper coding practices going to the way side so that profit margin can be improved. Mental note Microsoft….break your own code before breaking your customers pocket book. Its a failure in common coding practices.
This code should never have been written. Don’t re-invent the wheel. There are countless subroutines already written, and 100%l tested, that accomplish this same thing.
The entire discussion above is going down the wrong path. You’re trying to fix the bug. Wrong. You should be asking: why do programmers feel the need to continually re-invent the wheel?
Instead, the programmers should be looking at other solutions. Open source operating systems might be one of these solutions. Open source code which is has been used for over a decade by millions of people would not have such problems, because the bug would have already been hit (if it occurred) and fixed. When I have written code involving dates, especially leap years, I have always copied code from a trusted source (like BSD unix). All I know is that calendar routines are tricky — and that knowledge enough was enough to steer me, as an R&D engineer, into thinking: “I won’t try to write the correct code, I will not re-invent the wheel and possibly make a mistake here. I will find fully tested, already written, code to use here instead.”
This is why old software is often better than new software (Microsoft should have learned that lesson back with the original Windows NT, and Mozilla group definitely learned it on Netscape 6): old software, even if it has bloated into bad software, has been sufficiently bugfixed, especially in the commonly used library functions. Don’t re-invent the wheel! Why create new bugs?
This is clearly specified as part of the c std and there’s no need to re-implement the wheel (incorrectly and with a strange loop rather than a simple linear algorithm available from any good collection of numerical algorithms). strftime could have saved them quite a bit of embarrassment.
Everyone immediately wants to blame QA when a bug makes it into the end product. When companies start funding and staffing QA functions at the same level as development functions, come talk to me. But when the ratio of developers-to-testers is 3:1, 5:1, or even 10:1, you have to start adjusting your expectations and accepting some level of risk. Remember, no one in the QA department put the bug in there in the first place. That was a developer’s doing. Also, a bug of this type should have been caught during unit testing–also a developer’s purview. There is a saying “Everyone wants to go to Heaven, but no one wants to die.” So it is for QA. Everyone wants quality, but when you ask them to pony up the money and resources to do it, all you get is static. Sure, it’s easy to determine how this test condition should have been covered–after the fact! Hindsight is always 20/20. If it was soooo easy and sooo logical to test for this condition, then why didn’t the developer just code it correctly in the first place? Next time you want QA to test every edge condition, try telling them what all the edge conditions are. Ah, not so easy now is it?
@Ernie
Most of Microsoft’s teams have 1:1 Dev:QA ratios, though I don’t know about Freescale.
[...] all the non-coders out there, here’s what happened. According to Zadegan, the code is designed to calculate out the year by looking at the number of days that has elapsed [...]
If debugging is the process of removing the bugs from code, programming must be the process of inserting the bugs.
As a matter of professionalism, I check each few lines of code as I go and I do not assume Q/A will come along after me and do my job for me. After I have done my due diligence and made the code the best I can, it is valuable for a different set of eyes to observe the performance of my code. Programmers tend to test their code with the same assumptions they used to write it. When programmers are under such time constraints that they cannot adequately test the code as they go, the process has broken down. The worst case would occur if we became arrogant and assumed our bugs were someone else’s problem. I for one am thankful when someone else is willing to help me perfect my code, and I thank them profusely if they actually find a bug.
As for why 1980, I believe that the time handling routines in Unix were written around that time, and a time_t (long) was expected to be adequate to hold the number of seconds as far into the future as the phone company’s operating system was expected to last.
It has been at least a decade since programmers had to face the reality that there was no longer time to write each and every routine from scratch. Libraries and toolkits are part of what has helped improve developer productivity and you cannot be competitive if you don’t leverage off them where possible. I still hand code the lowest level when there is a reason, but I agree the C standard library time handling routines would have saved a lot of trouble.
What kills me is that somebody is allowed to write software who still thinks subtracting in a loop is a good idea. That’s why MLK invented division. Take that guy’s programming license away. Damn.
Yeah, you have to do a few checks and fixups to deal with a range of years encompassing leap years, but my god does anybody use their brain anymore? This is 3rd grade math for christ’s sake.
This is a perfect example of why software sucks nowadays. Nobody knows to apply simple math.
[...] Zune 30GB on New Year’s eve became nationwide headline news. Now the leaked source code has been diagnosed. The examination is pretty enlightening, as it explains exactly why the crash [...]
for(;;) {
register int d=365+isleapyear(year);
if(d<=days) break;
days-=d;
++year;
}
Short and sweet and if you have trouble figuring this out you’ll have trouble figuring out more verbose versions of same. FORTH actually makes this loop idiom a language feature.
In C++ adding a bool is fine. In C, not so much. Haven’t a clue about C#. Oh, and the calculation for leap years:
static inline bool isleapyear(unsigned y) { return y%4==0&&(y%100!=0||y%400==0); }
I like to not start out checking for the least-often-occurring case. Should probably rewrite the first check to a bit-op too.
@stu
In a number of cases, iterated subtraction is more efficient than division. One such example would be on a portable device which doesn’t run a desktop/laptop processor.
Jason Says:
January 5th, 2009 at 2:46 pm
@XS “I for one find it hilarious that people are spending this much effort talking about optimizing a piece of code which is rarely ever run on the device.”
It runs every time the device boots. What are you talking about?
–
Jason, what are you talking about? It runs only every time the device boots.
I should note that Jason has a (small) point. The boot process for the device is pretty slow as it is. Faster boot times wouldn’t be a bad thing, even though this would only affect it by a very miniscule amount.
Many years ago I wrote a date algorithm that blew up six months later, in a shipped product, but only in the Eastern Hemisphere, when daylight savings time arrived. First Korea… then the UK a few weeks later.
I’ve never treated date algorithms the same since. Where possible, I test by writing an inverse function, using a different technique where possible and test the two algoritms against each other for every reasonable input value. There’s no payback in trying to figure out where the edge cases are – in a case like this, a brute-force test for all of, say, the next hundred years would take no more that a second or so.
Then, if some hotshot decided that a loop was too slow and used a different algorithm, the test would catch all their mistakes at the NEW edge conditions.
In some organizations, this testing would be done by development, in others, by QA. I’m not going to judge which is better. But whoever was supposed to do it definitely fell down on the job.
I’ll just add to the legions already screaming USE THE DATE OBJECT! They must get paid by the line.
choose your language, it’s just as easy…
>>> import datetime
>>> datetime.datetime.now().year – 1980
29
>>> days_left_delta = datetime.datetime(datetime.datetime.now().year + 1, 1, 1) – datetime.datetime.now()
>>> days_left_delta.days
360
>>>
It’s amazing how many people are turning in “solutions” that are either incorrect, or worse, include the exact same bug. Fitch and Cellar both have errors in their code.
If anything this should be a lesson about how hard it is to erradicate bugs, even when you know they’re there.
Hmmm. . . After reading so much ado over a little while loop, I have to ask myself, “Is it any wonder Windows is so over-bloated?” Also, I realize now why we have patch tuesday. (IANAC++P).
Don’t worry, Scott.
This is C.
Some suggestion claimed it was hard to test for. It isn’t, it is a clear edge/boundary case and those are one of the key things to test if you are trying to limit the amount of tests you write. Any good testing strategy should pick that up and any good review of tests should spot that it was missed. I’d say 30 test cases that took about 20 minutes to do. But such a “simple” piece of code is exactly the sort of thing that people just try to test as part of the whole application because hey it’s less than a dozen lines of actual functional code and only used a little bit! Exactly the sort of dark corner where oops ups usually lurk.
First, I have to agree with the comments that this problem should have been prevented by using pre-existing, validated and tested code.
That said, I am (not) surprised that nobody has mentioned code inspections. I learned early in my career the value of proper code inspections using a formal process. Yes, I thought they were a pain in the tush when I first learned them. But as my experienced grew I saw the value of finding bugs as early as possible in the development cycle. Code inspections taught me to be a better programmer and saved me from debugging several serious problems either in the lab or worse, in a shipping product.
1. Test every condition which could happen in every single loop in your code (though, as Matt Boehm points out in the comments, this might be a tad bit hard).
There is an additional issue which should be mentioned about trying to test every possible condition in a program. Lets say there are just 20 if statements in a simple device driver then this means the QA staff or the developers who need to come up with 2^20 or 1,048,576 test cases to properly test their device driver. Possibly their grandchildren could complete writing all the test cases for them.
[...] ha messo ko il lettore multimediale Zune. Ebbene, dall’analisi del codice che ho trovato su questo blog il problema è da imputarsi ad un loop [...]
Regarding testing every loop/permutation, I have always been curious: how do you do it? Someone has to sit down and document the expected correct result for every loop/permutation. Surely he would have died of old age before he finishes a small percentage of it.
I always thought that void SomeMethod(int x, int y) is already a big disaster. The person who wants SomeMethod has to specify 2E+128 outcomes that he wants. So I have given up a long time ago about wanting to work in General Dynamics.
The people who do not forgive the Zune bug are simply armchair critics. I can bet they will be most unwilling to pay $20,000,000 for an MP3 player.
100% coverage testing would have easily caught this bug. First, it would have identified that the leap-year branch of the code had never been exercised by a unit test. Then, the person writing the unit test would have added a check for that, and the next run would have gone into an infinite loop. The reason silly bugs like this slip through is that the industry has deluded itself into thinking that 80% (or some other number less than 100%) is acceptable.
[...] 當偶發狀況發生時,不是每一個人都有能力去找到真正的原因,畢竟除錯是一門專業的學問,可能需要經過環境比對找出硬體瑕疵或故障、需要了解軟體運作與設定來找出設計上的漏洞、也許是時間造成、也許是相容性的關係,當然在台灣這種濕氣重+地震、颱風不斷的地方,更大大提升了硬體被影響的機會,縱然機房往往都是溫控、濕控的地方~ [...]
Something like this seems much cleaner:
for (year = ORIGINYEAR; days >= daysInYear(year); year++)
days -= daysInYear(year);
where daysInYear(int year) = (IsLeapYear(year) ? 366 : 365)
I haven’t tried it and for all I care to check it might not even work, but at least it is very easy to verify that the routine will eventually complete in all cases.
[...] culpable del problema fue un fallo en la programación del firmware de los Zune que hacía que una rutina que calcula la fecha se quedara en un bucle [...]
[...] La sfârşitul lui 2008 (chiar în 31 decembrie), o grămada de device-uri Zune s-au sinucis în masă din cauza unui bug destul de banal (via). Cum a fost posibil ca o bucată de cod să nu fie acoperită de nici o serie de teste şi să bage o parte a Microsoftului în şedinţă ? Iaca, fu posibil
. Detalii despre bug, aici. [...]
[...] tutti gliZuneda 30 GB sono > apparentemente svampati. Tra l’altro poi si
[...] > days -= 365; > year += 1; > } > } > More information and analysis can be found at http://www.aeroxp.org/2009/01/lesson-on-infinite-loops/ — jacob navia jacob at jacob point remcomp point fr logiciels/informatique [...]
I design mission-critical software for embedded powerpc and ARM processors. I also design the ASIC for custom processors. I have been in this business for 13 years. It’s tedious, boring, and I seriously regret ever getting my graduate degrees in computer science and electrical engineering. I will say this: It’s not the programmer’s fault! The lead engineer should have mitigated this problem at a higher level. If I relied on my code monkeys to output perfect code, there would be the blood of a lot of innocent people on my hands. . . literally.
Testing with “100% code coverage” doesn’t work: Unit Test Tools like VectorCAST can give you code coverage statistics, cyclomatic complexity, and paths exercised in Unit Testing, but that’s NO GUARANTEE your test actually exercised the code under all possible in-the-wild conditions. I can easily get 100% code coverage, yet the code does not work? Why? Because 100% code coverage only says that every line of code must be covered which is different from every code path. Even though the lines of code were “covered”, the result of the test may not have been dependent upon that path. Let’s say you got 100% code coverage via two tests: One test covers paths A and B. Test two covers C. That’s considered “100% code coverage” as it’s defined by unit test tools, yet the path A->B->C was never tested nor were any significant permutations of that. Furthermore, the test result of your Unit Test may have resulted in the same answer regardless of the existence of an error in that “covered” path. These are all situations I have run into.
My point is this: it’s impossible to catch every bug in an embedded program and perfection has never happened and does not exist. It’s really the lead engineer’s job to mitigate the risk of misbehaving code at a higher level. Coders are usually working mindlessly to higher level design. Here’s how I mitigate bad coding:
Rule #1: No loops. All code must be in the form of an FSM (finite state machine) with a WATCHDOG TIMER for each state.
Rule #2: No dynamic memory allocation. Screw what you learned in Computer Science. All structures must be static or global.
Rule #1 eliminates the possibility of the program hanging from infinite loops. (Zune problem)
Rule #2 eliminates the possibility of crashing due to overflow or null pointer.
Embedded C is far different from C programming. This industry isn’t about making code fast and efficient. It’s about making it manageable so that if bugs are discovered they can be found. All of the suggestions to make this code more concise would just make bugs more difficult to find. Verbose and explicit coding may seem juvenile to people still in school, but that’s the way people expect you to code in the real world. No one cares that you can write code in one line what might take an idiot 10 lines of code to accomplish. The byte code will look identical because of compiler optimizations.
Układanki – wybieranki…
Gdyby obiektywnie popatrzeć na oprogramowanie, które tworzyłem – nawet
te większe, latami rozwijane przez wiele osób systemy – to napisany
kod, jakkolwiek by mierzyć jego rozmiar, jest jakimś nieszczęsnym
ułamkiem całej infrastruktury. Malutkim zbiorem…