Rivian Failed OTA - A Story on How to NOT do Major Incident Management
Rivian's latest 2023.42.0 OTA release fails to load on production fleet vehicles, due to a "fat fingering" in the release process leading to hundreds (?) of vehicles in a semi-broken state. Rivian quiet on incident communications and informing owners a consistent message of support and transparency.
[Update 11/14 ~3pm Eastern] Similar or exact text was seen by owners on their registered SMS/TXT phone number.
[Update 11/14 ~4pm Eastern] Jose, the contributor behind RivianTrackr/@RivianSoftware posted a timeline on their Wordpress Site https://rivian.software/2023-42-timeline/ check out future incident timelines at this location.
If you are an EV owner... you likely have heard that at 7pm EST November 13th there was a relatively large scale (immediate known impact isn't public knowledge) that Rivian's November 2023 update (2023.42.0) resulted in "bricked" or "semi-bricked" vehicles around the world after the OTA failed to update and install on the various subsystems in the vehicle.
Those that haven't been following along, below are some public sources with information related to what happened, how Rivian responded and what owners are reporting themselves on the situation (in real time in some cases).
Failure of Incident Management
The first and foremost thing I want to say today is that this incident is a direct example of a failed public handling of a major incident (if this classifies as one, and I believe it does - though don't know the internal classifications Rivian's incident response team is using to determine this). Below I'm going to highlight a few of the failures in this event so far (things are still unfolding as I type though).
- No immediate messaging to impacted owners perfered communication method (eg. phone, email, text)
- Public messaging to Reddit (a subset of the owner community - should have been posted to a Rivian owned news or communication sourse)
- For those impacted Support teams were overwhelmed and out of the loop as to what was going on - leading to more confused owner experiences and venting to their communities
- Inconsistent Support messaging on reliability of failure system engaged - some owners got messaging indicating "not to drive your vehicle" and others received messages saying their "vehicle is driveable".
- Lack of public communication on user/fleet impact - percentage of vehicles, number of vehicles, number of owners, etc.
As many passionate owners have mentioned in various forms this style of communication and owner engagement is unfortunately par for the course for Rivian. As a fellow owner, and software engineer I hope and pray that Rivian is listening and has the guts and drive to be better in this department and really leans in to determine what all processes and protocols need to change to give their community the relationship and communication they deserve in times like this.
Software failures will and do happen.
How an organization responses to failure and communicates around the resolution process of that failure can make or break them. Attempting to build and retain trust through openness and transparency can be painful. Showing, accepting and talking openly about our faults is painful, not intuitive and too often times in society frowned upon as a symbol of weakness. However, without communication and being open and honest I've learned through marriage that this can do the opposite... it can lead to walls being built up, isolation, fear, assumptions, loss of trust, and ultimately a broken un-repairable connection to another individual (or thousands, millions even).
As a trained Major Incident Manager for a Cloud Software company as part of my day job, the way in which this incident has been managed is not up to par. Call it growing pains, call it start up life, call it what you want... but the process here is broken and needs to be fixed sooner rather than later. Ideally before the next wide scale (was this wide scale? we don't know and that adds to the noise here). Rivian does know the impact of this and should (good or bad as it may be) needs to own that story cause if they won't owners, media and the public will as they often do assume the worst which is never a good thing. Rivian does know definitely each and every impacted vehicle by VIN/VechicleId (as ElectaFi proves through their insight using Rivian's own Cloud API). For ElectraFi there are 31 of the 263 contributing vehicles in the fleet (production and pre-production included), obviously (or maybe not so) this is going to be an infated number as I would assume the users that have and know about ElectraFi are more prone to install available software earlier in the release pipe (maybe not anymore?).
Bad production fleet release occurred on November 13th, so installations tracked by ElectraFi prior to 11-13 are "internal beta" fleet vehicles that have opted into ElectraFi and/or under-NDA content writers that have a "press pre-production" release made available to them.
If these above numbers are to be believed to be true and extrapulated across the production fleet than the worse case is 10-15% of vehicles may have been impacted. I have information to believe it is drastically less than that, but still wider scale than a normal OTA day one release would have been in Rivian's past 1+ years of consumer vehicle delivers and 18+ OTA releases.
Silver Linings
I don't want to make excuses, but there are some silver linings in this production fleet incident that we only know learned about. Rivian Engineering may have known this themselves, but there is nothing in the public space that confirmed these findings until this event happened last night and owners around the country were using Twitter, Reddit, Discord, Facebook, Rivian Owner Forums and Signal to communicate with each other to determine what systems worked and which systems didn't work as well as some creative workarounds.
For context, my wife and I received our OTA notifications and being one that likes to beta test (or nightly test) software for the "greater good" (to my wife's displeasure) the OTA rollout campaign was pulled/halted before I could tap the Install Update button for my 2022 Rivian Blue R1T.
Rollout Campaign was Cancelled/Halted Mid-Flight
Many owners like ourselves received notifications after our vehicles phoned home and downloaded the ~3GB update file last night (around 6-7pm EST on a Monday). To my memory Rivian has never released an OTA on a Monday, usually Thursday/Friday. Similarly the optics from the outside is that this first released cohort was larger than typically done in the past (though that is my perception and not based in any data).
Thankfully many (?) owners received these "ready to install" mobile notifications only to learn that there isn't any button or ability to install the update cause their vehicles reverted their "Next OTA Version" back to the current production release of 2023.38.0 which was released in October.
Driver's Screen - Critical Functions Only
The driver's screen being a different control module than the infotainment screen (never really confirmed, but now it seems to be) is functioning AND even shows your speed and selected gear (ex. Drive, Reverse, Neural, Hold). Things that aren't functioning on this screen are the left driver information blocks and the realtime augmented reality / Unreal Engine backed driving visualizations.
Camera Displays - Rear/Reverse and Front/Forward Defaults
Though the main infotainment display is not functioning in this state the reverse cameras do still automatically appear as many owners confirmed and posted to their online communities.
HVAC - Remote Controlled Only (not while driving)
The first fear everyone had was "will my truck be hot / cold as HVAC controls are on the main screen". This is a legitament fear, as some have found out you can push remote controls for HVAC from the Rivian official mobile app. However, this only works for "pre-conditioning" and setting the default climate preferences for the interior cabin and seats... doesn't allow dynamic changes while the vehicle is in motion. To change temps, or seat controls one would need to pull over safely and modify the controls manually in the mobile application for the time being.
Media Coverage
and counting...