Putting the Fun in Debugging 20,000 Distributed Devices
Hardware is Hard, Avoid it at All Costs
My first job out of college offered me many interesting experiences. I had somehow found myself the first and only hardware engineer at a software startup in the Valley. The company's initial distribution strategy was installing software on POSes (Point of Sales systems) in small businesses. They quickly found that patching the product to support any and every flavor of POS operating systems out there was a real pain in the ass. Furthermore, there were some POSes gaining traction with closed ecosystems (cough Square cough), who were building their own version of our product. Not wanting to compete in such an environment, they had a brilliant idea: ship out their own hardware to merchants that have a POS that they weren't able to integrate with. Ahhh, the best-laid plans of mice and men. They began happily shipping tables en masse. With no team or infrastructure to support the hardware, things started breaking for customers, while the company had very little insight into how or what was going on.
Enter me: a bright-eyed bushy-tailed electrical engineer. At the time my main interests were electric motors, power generation, power distribution, RF. Essentially, I was enamored with Maxwell's equations the applications of that set of equations. This is what I daydreamed about and believed the engineering profession would offer. Instead, I found a desk with a gigantic heap of Salesforce support cases all labeled with "Tablet broken/power problem" and a command from on high: "Nick, to delight customers, the product should Just Work. Sort through these cases, figure out why they aren't Just Working and make the tablets Just Work."
3rd Party Libraries Warrant Suspicion
After spending a few hours reading through the cases and finding no technical information in any of them, I uncovered a clue. There was a very noticeable uptick in the number of reports starting at a particular date. I asked the development team for the release notes of the product that runs on the tablets from around the dates I suspected something odd of happening. What did I find?? They shipped a bouncing text animation that was driven by a 3rd-party npm package. I proceeded to measure and benchmark the app with and without the animation. The package increased the application's power consumption by 2.5x. Our wall chargers were suddenly unable to supply enough power for the hardware to have a net positive power profile. LESSONS LEARNED: be suspicious of code you didn't write, and always have detailed release notes.
They Still Don't Just Work
"Okay, great job kid", my boss told me after we ripped out the power-sucking package and hardware cases began to decline. "But what are you going to do about the baseline 5% of our devices that have never Just Worked?"
The next day at standup I made a naive suggestion. "After reading all of the support cases and noticing a complete lack of technical information, it is clear with our current processes we are not going to gather and document the data that I am going to need to diagnose our problems. We need to re-train the Ops team to be more diligent and thorough about collecting and documenting this information." I proudly declared this expecting someone, somewhere to heed the call and scurry away to kick the big re-train initiative off. Instead, my words just bounced off the empty halls. My managers probably just laughed. Then gave me some advice. "You want to skin a cat, you're gonna have to do it yourself. Figure out a way to get that information yourself. It will be much more reliable at the end of the day."
To Live is To Dream
So finally our young hero (me), finally got a chance to do what he had always wanted: to dream and to build. I dreamt up a system for getting telemetry from all of the tablets we had in the field.
Here's what I deemed necessary for the project:
- Client code to access the information from each device while it was in the field and report it back in a heartbeat
- An event database to hold all the incoming heartbeats from devices
- A robust way to pipe all of the heartbeats from the devices to the event database
- An application that I could use to query my data and build dashboards that could paint a picture of what's going wrong with the devices
I knew that there was a plethora of hardware information hidden amongst the /sys/class special files on the tablets. All of our tablets were rooted, so I knew that I had access to all of this, the only problem? Our instore application that ran on our tablets was built using Angular and Cordova. This meant that to access system-level data from the device I would have to write a Cordova plugin to pass that information up to the JS layer. I wrote some modular code that could run command line commands with root privilege and parsed the responses. This I then passed up to the application/JS layer and wrote some code to send up to our WebSocket servers at 15-minute intervals.
To allow for fast and reliable communication between tablets belonging to a single merchant, the product team built a WebSocket service that all of our tablet clients connected to. Because the connections between our devices and the WebSocket server stayed open for long periods of time, I knew that routing the data through this server application to its final destination, the event database (which our servers also kept open connections with), would be the most robust way to get the data from the devices to me. When operating in the sketchy unknown networking environments of small businesses you can never be too careful. I set up the servers with the Treasure Data’s client library and wrote some code to receive the heartbeats from our devices and send them along.
I then set up a table in Treasure Data with a schema that matched the payload of the heartbeats. Soon 20k tablets from across the United States of America would be sending all their unhealthy secrets to this special little DB, with the bland and boring name (in comparison to the power that it would hold!) of
Finding The Root Cause
Now that I had data directly from the source on my hands, I finally felt that I was in a position to diagnose some problems. I could cross-reference hardware serial numbers from support cases with data from my database to understand exactly what might be going wrong with the broken tablets. Furthermore, with the aggregate dashboard, I had made, I could estimate the real size of our problem...and it was big.
It was immediately obvious that many of the tablets were just not plugged in while being operated, causing the tablets to die and then end up in the dreaded DoD (Depth of Discharge) mode. Let's just jot that down as an unsolvable user error for now (I designed a solution for that as well, but not in the scope of this post). Many more tablets had something much more insidious going on. They were dying while plugged in. And you can be sure at this point that I had confirmed that the application we were running was no longer consuming 2C (C being the max amount of current the battery was rated to accept while charging). So what was going on?
I found a few culprit devices and pulled their individual data and graphed them in a time series to try and find patterns. This is what I got:
From the graph above I began to piece together what was going on. Something was limiting the current entering the device's battery. Current limiting on its own is not a cause for alarm: lithium-ion batteries require a power management IC (integrated circuit) to sit between them and the 5V transformer. It is that IC's job to make sure that as the battery reaches its maximum capacity, it should stop pumping in current; if a lithium-ion battery keeps receiving current while it is fully charged it can ruin the chemical composition of the battery and in turn reduce the amount of lifecycles the battery will have. The problem here though was that our IC was limiting the current before necessary, and once it started limiting the current it wouldn't stop.
I found the datasheet for the offending IC and took a look at its state machine diagram:
Reading this I realized that the IC would think the battery was finished charging and begin limiting the current anytime the voltage that the battery detected at the input was below the voltage of the battery. So, unless there was a bug in this IC and it wasn’t working as described in the datasheet, there must be something weird going on with the only input of this system: the voltage detected at the battery. This voltage is what comes from the wall-wart transformer (measuring many samples with a multimeter gave me values between 5-5.2V), then down the cable that we provided the customers. If the voltage drop between the transformer and the device (along the cable) was too high then the voltage detected at the battery might be artificially low, and lower than the voltage of the battery before the device reaches full capacity! Eureka!
I immediately grabbed one of the 3 m long cables we shipped with the tablets to customers and cut it open. face palm. Incredibly thin-gauged wires, braided with some material so brittle that the braids were snapping with the slightest bend. Knowing that the resistance of a conductor linearly increases when it's cross-sectional area decreases, I figured that as these cables were wearing in the field their cross-sectional area (which was already small due to a thin gauge) was getting smaller when braids split, resulting in a large voltage drop along the cables we provided. This linear drop in voltage non-linearly affected the power profile in our devices due to the logic of the current limiting IC.
The Dreamer's Solution
I brought my findings to the team and from there we worked to source cables with much lower resistance and a higher tolerance for wear. We began shipping those cables to new customers and to all of the businesses that I had identified as having a problem through my dashboards. And all was well: the tablets began to Just Work and our hero got a big raise and his name could be heard echoing off the walls of the office from cheers of praise and adoration. Just kidding, hardware problems never end, that was probably just our hero dreaming again.