Evan Meagher

More trail safety tips

2023-02-21T00:00:00-08:00

As a follow-up to my last post recommending that trail runners carry first aid kids, here are a few more tips for staying safe when running or hiking on trails.

1. Bring water

You should always carry water if you’re planning to be out for more than an hour or if it’s particularly hot out. Heat stroke is scary and can sneak up on you.

2. Bring your phone

Your phone is a table stakes piece of safety equipment. With the obvious caveat that its usefulness deteriorates if you’re going outside of cell service.

A phone helps you know where you are and allows you to call for help.

With a good mapping app, you can track your progress along your planned route and avoid getting lost. Google/Apple Maps have decent coverage of urban trails, but you may want to graduate to a more trail-specific app like Gaia GPS or AllTrails if you’re going off the beaten path. I use Gaia GPS and love it.

Most importantly, if you get into trouble, you need to be able to call for help. For instance, when I fell and gashed my knee against a retaining wall a few months ago, I was able to text my wife and ask her to pick me up at the nearest trailhead. This saved me from having to hobble all the way back to our house with an open wound¹.

In addition to these basic safety features, a phone can help you

identify plants and animals
write notes on ideas inspired by your time outdoors
play music/podcasts²

Maybe we’ll get to a world where a watch can cover all of these bases (Apple Watch Ultra?). But for now, don’t leave your phone in your car when you head out at a trailhead.

3. Bring a whistle

On the same theme of being able to call for help, carrying a whistle is an easy way to help people find you if you get lost. They’re compact and can be heard from far away. So if you stumble down a hillside or find yourself way off the trail, you can make a bunch of noise with the whistle to alert other trail users in the area that you need help.

Bonus points if you learn the basic distress signals. For the SOS signal in particular, here’s a handy visual mnemonic for remembering whether the short or long triplet comes first:

As for what whistle to get, I have one of these on my keychain. It’s really slim, so doesn’t bulk up my keys. Alternatively, you could get a more traditional coach’s whistle like this.

4. Bring your ID and insurance card

This one’s a little grim, but if you get seriously injured and can’t identify yourself, having an ID on your person helps emergency services personnel know who you are. Similarly, having an insurance card can help streamline your way through the healthcare system.

You might also throw in a laminated card listing your emergency contacts. I don’t currently do this, but probably should add it to my kit.

I slip these in a zippered pocket of my running shorts whenever I head out, which has the side benefit of letting me leave the rest of my wallet at home when driving to a trailhead.

You can always throw in a credit card if you’re planning to hit a cafe or convenience store on your route (or else rely on Apple/Google Pay, since #1 above convinced you to bring your phone!).

There’s often a fine line between preparedness and paranoia, but basic safety precautions like these are painless and help prevent the worst outcomes if you run into trouble in the wilderness.

Stay safe out there!

In that particular case, I would've survived, but it would have sucked and increased the risk of a much longer recovery time. ↩
Using headphones! Don't be the person blasting music nobody wants to hear in the wilderness.↩

Trail runners should pack first aid kits

2023-01-02T00:00:00-08:00

A couple weeks ago, I was running on a trail near my house when I slipped and fell, gashing my knee against the corner of a crib wall¹.

It was thankfully just a flesh wound, but quite painful in the moment and enough to put fitness activities on hold for a while. The experience spurred me to put together a compact first aid kit to bring with me on trail runs, something I’ve been meaning to do for a while.

This post explains why first aid kits are a worthwhile piece of gear for trail runners and describes what I’ve included in my kit.

Why pack a first aid kit?

2022 was an exciting year for trail running as a sport. Sponsorship and prize money is flowing, people are hungry to spectate big races like UTMB, and more amateurs like me are hitting trails in their communities.

One of the great things about running around in nature is that it lets you disconnect from the stresses of civilization. But this also means you’re going to be far from any kind of help if you get injured. Emergency services will take a while to get to you, doubly so if the trail is technical or not well mapped.

This article gives a particularly grim depiction of the risks you take when exerting yourself alone in remote areas. After reading it last summer, I did a little research on outfitting a minimal first aid kit that wouldn’t be annoying to pack when running. Go figure, I had to wait for my next injury out on the trail to complete the project.

What I pack in mine

There’s a bare minimum level of emergency preparedness that one should take when trail running. Aside from always bringing your phone², it’s a good idea to carry a very basic set of first aid materials.

A good goal is to help ensure that an injured person (you, a friend, or a fallen stranger you come across) can get patched up enough to make it to a trailhead for evac by car. Like any other piece of running gear, you want to keep things compact and light.

I bought the smallest first aid pouch I could find. You could of course use any old pouch or a ziplock bag, but I like this one because it’s red and clearly labeled.

Whatever you choose should fit easily into whatever running vest or belt you use when journeying into remote areas. I use a Naked running band most of the time and the little red first aid kit fits nicely into any of its three pockets.

Inside the kit, I keep a handful of items mostly focused on basic wound treatment:

Antiseptic wipes³
Band-Aids
3”x3” gauze pads
A small roll of gauze for wrapping bandages
Antibiotic ointment
Ibuprofen
A pair of tweezers, in case I have to remove a tick from myself or my dog

A couple of the items photographed here were cannibalized from my home first aid kit. Be sure to restock any larger kit you take stuff from so you’re not out of something you need in an emergency.

Other items you might include

What you pack comes down to the types of emergency scenarios you want to be prepared for. Scrapes and cuts tend to be the most common form of injury from falling on a trail, hence the focus on wound care. But you might choose to include other items depending on your body or geography.

If you’re prone to rolling ankles, athletic tape can be used to wrap up and immobilize a sore ankle to help you walk to the nearest trailhead.
Some Tecnu and a small rag would be a great addition for dealing with poison oak.
Ticks are a constant menace in my area, but you might want to pack items specific to other kinds of wildlife. For instance, for bears, the general guidance is to make a ton of noise. A small airhorn can help with this—the Ginger Runner recommends this one in his recent 2022 gear of the year video.

Stay safe out there!

A crib wall is a type of retaining structure that holds a trailbed in place along a hillside. Here's an example (one that I helped build, in fact). ↩
It shocks me how common it is for folks to run without a phone. I know I wrote above how running it a great way to disconnect, but your phone should absolutely be seen as a table stakes piece of emergency equipment for communicating if you need help. ↩
Mild preference for benzalkonium chloride over alcohol, since they’re supposed to be better for cleaning wounds rather than the skin around wounds. ↩

Momentum is magic

2022-02-07T00:00:00-08:00

A concept I’ve frequently found myself referring to lately is momentum.

In physics, momentum is a quantity that captures the tendency for a moving object to remain in motion. Once something gets moving, it takes energy to slow it down, and the forward motion often becomes the new default state.

This is a great metaphor for project work.

At the start of a new project, it’s easy to procrastinate or become distracted by snacks. A proven way to overcome this inertia is to get a few quick wins that point you in the desired direction. The quick wins snowball into a tangible sense of momentum that drives a team forward towards the broader goal.

What can momentum look like in practice?

A new team forms around some big problem. Say, customers have been complaining about slow page loads or the company has decided to prioritize a hairy migration to a new technology. The eventual goal feels daunting, but the team prioritizes a handful of low-hanging fruit tasks that feel like stepping stones towards the Big Outcome. After only a few days on these tasks, the team has collectively built a mental model of the problem and a clear path forward emerges.
In the realm of company-formation, the lean startup methodology is essentially a framework for generating momentum. By tightening the feedback loop between product development and customer feedback, a team is able to build confidence in a solution to a problem faced by some market.
I know that blogging is supposed to be good for my career. But I’m only able to get over the hump of motivation to write a thoughtful longish-form post once or twice per year. The post you’re reading is an attempt to leverage momentum by writing shorter posts when inspiration strikes. Stay tuned for whether it works!

Takeaway

A project’s success is determined by the team’s ability to generate and sustain momentum. A good way to generate early momentum in a project is to prioritize quick wins and explorations that tighten the feedback loop.

How to avoid overengineering

2020-10-01T00:00:00-07:00

This article considers the conditions that lead teams to produce overengineered software and describes how you can avoid falling prey to such conditions.

What do we mean by “overengineering”?

When a software developer says that a piece of software is overengineered, they are saying that they think it has too many moving parts, too much abstraction, or an excessive emphasis on performance. The number of concepts required to understand the thing feels unreasonable.

It’s a fundamentally subjective call, but you know it when you see it. Like a Ferrari at a go kart race, an overengineered system is out of scale with its operating environment and intended usage.

But how does it happen? Are there certain conditions that lead teams to produce systems that observers would perceive as overengineered?

To determine this, let’s consider two stereotypical software engineering phenomena: cargo culting and a related pattern that I’ve started to call “the Xoogler effect”.

From there, we can characterize certain cognitive biases that drive a team towards overengineering.

Cargo culting

Fig. 1. The Trillion Dollar Homepage.

A cargo cult is a cultural phenomenon in which technologically-advanced artifacts become objects of obsessive ritual to a group of people outside of the artifact-producing society. While the term has fallen out of favor within the field of anthropology (in acknowledgement of its reductive and colonialist overtones), it’s fairly common in software circles.

To a software engineer, “cargo culting” is used pejoratively to refer to the adoption of a technology or practice based solely on its origin or popularity. Loosely, the thinking goes that if a tool, language, or convention was developed at or inspired by ideas from a large, successful company, then that tool must have contributed to the company’s success. Thus, in adopting it, you increase your odds of also succeeding.

Examples:

“Era-defining companies like Intel and Google used OKRs, so we should too.”
“The SRE book talks about how Google relies on service level objectives, so using them will help our services become more reliable.”
“Most successful companies end up needing advanced load balancing and request-proxying systems to scale their microservices architectures. Our 10-person startup should adopt these systems in order to help us scale.”

Like incorporating in Delaware or only hiring graduates of name-brand universities, a particular practice correlating with notable instances of company growth doesn’t imply causation. Just because a popular or successful company uses a technology doesn’t mean that it’s appropriate or worthwhile for your situation.

The Xoogler effect

When people leave engineering jobs at big, successful tech companies, they take their former employer’s engineering culture with them. This tends to be highly valuable, for both the technical acumen of the new company and the incoming engineer’s compensation package.

But this tendency can be taken too far. The incoming employee, believing that they have a direct line to the state of the art, may go on to recreate systems in their former employer’s image to an unreasonable degree.

Not to pick on Googlers, but something about that company’s culture really brings this out. There are many notable examples of Xooglers replicating Google-internal systems and practices in the outside world, either as independent startups or efforts within existing companies. I don’t find this terribly surprising, given the tone that’s set within Google—you spend your time there constantly being told that you’re among the best software engineers to walk the face of the Earth, using technologies unmatched in their power, quality, and scalability. It makes sense that former employees feel compelled to emulate things that worked at Google¹.

But this generalizes to other large and/or successful companies. Many people leave the productivity bubble of a “FAANG and friends” company and end up building tools and systems that they miss. To name but a few, this is how we got Thrift, Envoy, and three different tools based on Google’s build system.

Cognitive biases

I think these two phenomena can be linked to a handful of widely-known cognitive biases. Cargo culting is a clear manifestation of authority bias. And the Xoogler effect seems related to the law of the instrument.

It’s rational to want to stay within your circle of competence, but it can be counterproductive in excess. When you’ve been inculcated in the expert use of hammers, every problem looks like a nail. Your thinking in new environments is anchored by prior experience in old environments, which can impede learning and the development of new skills.

And it turns out that these two patterns can feed on each other. Cargo culting is catalyzed by the Xoogler effect in cases when a former big company employee’s experience is uncritically venerated by new colleagues.

How to resist the urge to overengineer

Resisting the temptation to overengineer requires one to be honest with themselves about the context in which they’re operating.

You should always design systems to address problems that are in front of you instead of falling for solutions to problems faced by big companies in the past. You shouldn’t base your company’s infrastructure on GitHub stars. And as wistful as you may be for your last company’s tools, they may not be as applicable to your current situation as you think.

Assessing cargo without becoming a cultist

When evaluating a popular tool, language, or convention, it’s worthwhile to take the time to understand the underlying forces that motivated its invention². Before making any decisions, try to contextualize a technology within the environment that formed it. From there, you can pattern match that environment to your own in order to determine if the technology is appropriate.

As an example, let’s consider everyone’s favorite overengineering punching bag: Kubernetes.

Kubernetes traces its conceptual lineage to Borg³, the cluster-management system that’s run Google’s production workloads for over a dozen years. The main value that Borg provides to Alphabet is its ability to maximize the capital efficiency of their datacenters. The technology that we now call “container orchestration” allowed them to get more computing oomph out of their fleet by maximizing utilization and making more efficient use of machines they’d already paid for.

Is maximizing capital efficiency of your datacenter or cloud environment a concern for your company? If you believe Kubernetes offers other benefits⁴, are you confident that they outweigh the complexity and maintenance overhead?

I find this thought process helpful when evaluating new tools, patterns, frameworks, or management practices. Conventional wisdom advocates that you fully understand a problem before applying a solution to it. It’s also important to understand the conditions that led to a solution and how those conditions align with those you find yourself in.

Regardless of the decision you end up making, this process is likely to make you a better engineer. The job is fundamentally about building systems to solve problems at reasonable cost. That cost is measured not only as upfront capital expenditure, but in the ongoing drag associated with maintenance and cognitive load. More often than we realize, engineering is about right-sizing a solution in light of these ongoing costs.

Things I Learned From Five Years in Climate Tech

2020-02-24T00:00:00-08:00

This post also appeared on The Next Web in March 2020.

Over the past five years, I’ve worked at two startups in what is now being called the climate tech sector¹. Given the recent surge in interest in this space, I thought it would be worthwhile to record a handful of lessons that I’ve learned from this experience. These lessons span business strategy and the realities of the electric utility industry.

Timeline: What has this guy actually done?

I left Twitter in early 2015 with the goal of finding software opportunities in grid modernization².

Soon thereafter, I met a couple guys who were leaving LBNL to start a company around a novel sensing technique for monitoring home energy use. I joined them as the first employee of Whisker Labs, where we set about making it cheaper and easier to mine residential energy data. After a handful of successful pilots and an acquisition, I parted ways with Whisker Labs in 2018.

I then spent some time developing an idea to address painpoints I’d observed while working with electric utilities at Whisker Labs. I ended up joining an early-stage team at X that shared my thinking on this specific problem space. I can’t say much of anything about this project, due to the secretive nature of X.

The following unordered and subjective list reflects my thinking about startups in the energy space, particularly those that aim to sell software to utilities.

1. Consumers don’t care about energy

The ideal energy system is one that fades into the background. The overwhelming majority of people don’t ever want to think about how electricity is delivered to their homes or how much of it they’re using. No one actually wants to look at time series plots of energy usage. They just want the lights to turn on when they flip a switch or yell at Alexa.

Early adopter types might enjoy seeing in real time how much power their solar panels are generating or their car is drawing, but this is not functionality that will drive user engagement or revenue at scale.

2. Exits are different than those for traditional tech startups

To be blunt—you’re probably not going to sell your climate tech startup to Facebook. In the event you do get acquired, a statistically likely buyer is an oil major like Exxon Mobil, an industrial giant like Siemens, or a European utility like E.ON, Enel, or ENGIE. These companies aren’t going to pay $10 MM per employee or cater to your taste in programming languages or productivity software.

This isn’t to say that energy startups can’t have exits that make money for founders and investors. But the scale of these exits and the character of acquiring companies is typically not what you see for software startups in other venture-fueled verticals.

3. You live or die by the trust you build in the industry

Energy is the ultimate example of an industry where “build it and they will come” doesn’t work. For utilities, downtime is often measured in fatalities and switching costs are massive. There is huge institutional inertia impeding the adoption of new technology.

Overcoming this inertia takes a long time and requires that your team include deep subject matter experts who can speak the language of the industry. As in any other enterprise setting, you need to build real relationships with stakeholders up and down your target customer’s org chart.

And realize that the people who write checks are typically not the people who will use your software. As a consequence, the quality or state-of-the-artness of your technology often won’t be a primary contributor to your success.

4. Energy economics are a poor match for venture capital

Fundamentally, energy is a commodity good. This translates to razor-thin margins for companies whose primary product is electricity or natural gas. If your business involves selling to these companies, then you inherit their commodity-driven nature, making it very difficult to achieve returns that are attractive to investors used to SaaS companies with 80% margins.

The way many successful companies avoid this problem is by changing the nature of the product they offer. By framing your business around something other than energy/savings (e.g. comfort, convenience, automation, luxury), you can escape the razer-thin margin game. Your overarching mission can still be to save energy or reduce emissions, but as a by-product of a valuable service that customers will pay real money for. In consumer settings, this approach can also help work around the fact that most consumers don’t care about energy (discussed above).

Examples:

Instead of batteries, sell cars.
Instead of home energy efficiency solutions, sell smart thermostats.
Instead of commercial energy efficiency solutions, sell employee comfort.

5. Beware the utility sales cycle

The glacial pace of the utility sales cycle has a big impact on cash management and fundraising for startups. Companies aiming to sell to utilities need to be extra mindful of runway and plan far in advance the traction metrics they can realistically raise money on.

Utilities are some of the most-regulated and slowest-moving companies in existence. They plan budgets in multi-year increments and introduce new technologies over the course of decades. Staying alive as a utility vendor or service provider requires minimizing your company’s burn rate and scraping together enough early pilot projects to show traction to the next round’s investors.

For instance, say you start a company with money from friends and family to build a product that will help utilities decarbonize the residential sector. In order to raise money in 9–12 months, institutional investors will expect to see one or two signed pilot agreements and demonstrated progress towards the milestones of these pilots.

So you send emails, get on the phone, and rack up airline miles to sell the stuffing out of your idea. You convince a decision-maker at a utility across the country that your product is the perfect fit for a demonstration project they’ve been planning for months. This person introduces you to an executive with “Innovation” in their title and helps you develop a case for why your product is exactly what they’re looking for.

Three months later, you’ve made it through an RFP process and landed a $5 MM pilot. The pilot is split into three phases over the course of the next five years, with most of the money coming in the last phase. $500k lands in your bank account for phase zero, which only extends your runway three months once you factor in the people you’ll need to hire. Seven months has passed since your friends-and-family round and you’re losing sleep over the number of parenthesized numbers in your financial spreadsheets.

An aside on funding—grants (for instance, from the Small Business Innovation Research program or the Department of Energy) can be an alternative source of early stage funding that is better suited to many energy projects than traditional VC. They provide patient capital and, in many cases, don’t take an equity stake. Grants can help get an R&D project off the ground without having to jump on the VC treadmill right out of the gate.

But grants do have downsides. You lock yourself into arbitrary demonstration milestones that tend to quickly diverge from where market opportunities lie. Customer development should always come first, and you run the risk of getting sidetracked with the care and feeding of grant milestones.

6. Policy is more important than technology

As far as bending the warming curve towards 1.5° C is concerned, I think policy and regulatory reform is currently a larger source of leverage than technology. In particular, carbon pricing (if done right) is the single most effective tool we have to reduce global emissions.

Don’t get me wrong—technology innovation is obviously a primary means to addressing climate change. But I feel more and more that we’re approaching the point at which technology has gone about as far as it can within the confines of a 20th century policy regime. Buying EVs and solar panels make us feel good, but only represents marginal progress while oil subsidies artificially prop up the internal combustion engine and campaign finance laws give incumbent fossil fuel companies undue influence to hamstring the deployment of clean technologies.

Even within an antagonistic policy framework, solar, wind, and batteries are proving to be cost-competitive with fossil fuels in many geographies. Imagine how far they’ll go once the rules actually incentivize them at scale.

I don’t believe that we’ll get where we need to go without a global price on carbon. This is why I volunteer with the Citizens’ Climate Lobby to build political will for a national carbon fee and dividend policy in the US. It seems insurmountable given the current political climate, but you’d be surprised how much progress we’ve made.

Closing words of encouragement

As I’m sure this essay has made clear, five years of building technologies for sale to utilities has left me rather burnt out by the slow pace and legacy nature of the energy industry. That being said, I remain excited by the incredible work being done throughout the energy/climate startup ecosystem. I don’t mean to dissuade anyone from taking the plunge, but founding teams need to know what they’re signing up for.

And investors need to know what they’re funding. Energy is not the realm of hyper-growth SaaS apps that can scale to 10 million users in the blink of an eye. A utility might only have 20 people in their organization that will meaningfully use a specialized product. And it could take three years of sales and integration legwork to get a product paid for and incorporated into day-to-day work.

So take these lessons with a grain of salt. If you’re setting out on a startup journey in the energy sector, your mileage will undoubtedly vary. I’m rooting for you.

Thanks to Apoorv Bhargava, Nick Clarke, Alexa Rhoads, and Oren Schetrit for reading and providing feedback on drafts of this essay.

Née energy tech, green tech, clean tech. ↩
By grid modernization, I’m referring to technology often associated with the buzzword “smart grid”. It involves augmenting the electric grid with information technologies in order to make operations safer, cheaper, and more flexible. Modernizing the grid is table stakes if we want to decarbonize the economy, as it's fundamentally not designed for a world of intermittent renewables and massive energy demand spikes from EVs. ↩

My Climate Lobby Hobby

2019-03-29T00:00:00-07:00

In a departure from this blog’s typical tech-related content, I’d like to write about some work I’ve done outside of the confines of my day job(s).

For the past few years, I’ve volunteered as a member of an organization called the Citizens’ Climate Lobby. CCL is a non-partisan, non-profit advocacy group whose mission is to build political will for solutions to climate change. Specifically, the group is laser-focussed on lobbying Congress to enact a policy called Carbon Fee and Dividend. This policy would make it more expensive for companies to extract and import fossil fuels, and thus help steer the economy towards better, cleaner sources of energy.

The hook of the policy (in contrast to others like California’s cap-and-trade system) is that the revenue generated by the carbon tax would be distributed evenly to the populace, as monthly dividend checks to all US citizens¹.

To cut to the chase, I’m excitedly writing this post because my congresswoman, Rep. Barbara Lee, this week co-sponsored the bill we’ve been lobbying for, the Energy Innovation and Carbon Dividend Act. This is the culmination of many meetings and calls with congressional staff, letters to the editor, town hall appearances, fliers handed out at farmers’ markets, and alliance-building with other environmental groups. This achievement for our local CCL chapter here in Alameda County is one among many that are happening all over the country (including red districts) as we build a base of bipartisan support for this carbon pricing policy.

Fig. 1. Photo of our team at a recent lobbying meeting at Rep. Lee’s office in Oakland.

That’s my story. I’m really proud of this accomplishment for our chapter and I wanted to share it. Getting involved with CCL has been a desperately-needed breath of fresh air amid the toxic political climate and news cycle we’re all living in. And it’s been a great way to meet inspiring people and learn the ground game of activism.

There are glimmers of hope for representative democracy in America, and it looks a lot like the Citizens’ Climate Lobby.

Thanks to Dr. Ted Obbard, chapter co-lead of CCL Alameda, for reading and providing feedback on drafts of this essay.

There are a bunch of features that make the policy great, regardless of which side of the aisle you’re on. It’s effective because it would drastically and rapidly reduce carbon emissions. It’s good for the economy, given that it doesn’t increase the size of the government. It’s forecasted to create millions of jobs and it give companies predictable targets for fuel prices. And by putting money directly into people’s pockets every month, it’s an equitable solution that disproportionally helps out households in the lower end of the income spectrum. ↩

Defining resiliency in energy and software

2017-10-19T00:00:00-07:00

The word “resiliency” is all the rage right now.

The notion of resiliency is uniquely applicable in a systems context. Specifically, it is a desirable feature of any system that is made up of many moving parts that operate in a distributed, coordinated fashion.

In such systems, failure is an inevitability that must be planned for. Whether you’re talking about an electricity grid or a network of software services, the study and construction of distributed systems necessarily entail having to worry about component failure. Planning for failure and designing systems to be able to mitigate its impact is at the core of resiliency.

To study the implications of this mindset, let’s take a look at how resiliency is defined in the electricity generation, transmission, and distribution industry.

Setting the stage

Within the US energy sector, a debate is raging among industry analysts, regulators, and vendors on the degree to which renewable energy resources could wreak havoc on our electricity grid. On the one hand, proponents of the status quo of subsidized fossil fuels and centralized power generation are sowing fear that intermittent solar and wind generation will cause brownouts and systemic failures. Countering this narrative is a growing pile of research and field evidence that indicate that the distributed nature of renewables—and particularly the one-two punch of solar-plus-storage—will make the grid more resilient to systemic failures.

These terms (reliability, resiliency) are also all the rage in the software industry. In fact, if you blur your eyes a little and think abstractly about the systems involved, recent trends in software architecture look strikingly similar to those in the renewable and distributed energy resource space.

Fig. 1. Centralized vs distributed architectures of power and data generation, transmission, storage, and consumption.

The last couple of decades saw similar arcs in the trajectories of SaaS-/IoT-era software stacks and renewable energy resources. In place of relying on big, centralized resources, we’re seeing more use of distributed resources. Analogous to the onslaught of microservices and smart devices, the future of the energy grid lies in energy harvested from solar panels or demand response providers, and stored in batteries, cars, or even water heaters.

Defining reliability and resiliency

In my experience in the software industry, “resiliency” is one of those whizbang words that’s fun to throw around, but remains generally ill-defined. Often, reliability and resiliency are used by executives to describe effort spent paying down technical debt.

What is said	What is meant
"We're going to focus on reliability this quarter."	"I'm getting flak from customers/investors about our app not working, so I want you to fix bugs, reduce latencies, and increase success rates, potentially at the cost of timely feature development."

By comparison, these terms are very precisely defined and measured in the electric power industry.

For the electric sector, reliability can be defined as the ability of the power system to deliver electricity in the quantity and with the quality demanded by users. (…) Reliability means that lights are always on in a consistent manner.
Aaron Clark-Ginsberg, What’s the Difference between Reliability and Resilience

In this light, reliability is binary along the time dimension—your thing either works under a given set of conditions or it doesn’t. These conditions are typically defined in a service-level agreement (SLA), which a service is charged with adhering to over time.

Fig. 2. Screenshot from Stripe’s system status dashboard, which is used to signal whether or not their systems are functioning properly.

Resilience is more complicated.

Resilience, stemming from the root resilio, meaning to leap or spring back, is concerned with the ability of a system to recover and, in some cases, transform from adversity.

Clark-Ginsberg’s report goes on to say that “resilience operates from a systems perspective, understanding incidents as a complex process occurring at the intersection of natural and human forces across multiple scales, evolving and changing over time.”

Reliability and resiliency, while related, are fundamentally different attributes. Resilience involves the gray area of partial failure, as in the case of a rolling brownout or a broken widget on an otherwise functional web page. It implies thinking of a service as a system of constituent components, with too many moving parts to be reasonably characterized using a simple “does it work or not” rubric.

We aren’t so different, you and I

With respect to the nature of renewable energy and software systems, it’s not a coincidence that both can be characterized as distributed systems or that both lend themselves towards discussions of resiliency.

In both cases, intermittent and composable resources require thinking about a service as a distributed system. Part of distributed systems theory and practice is the notion that failure is inevitable, and thus the topic of being resilient to failure is paramount.

Thanks to Oren Schetrit and Berk Demir for reading and providing feedback on drafts of this essay.

Lessons Learned Putting a Thing on the Internet

2017-05-02T00:00:00-07:00

This is a text version of a talk I gave on May 1st, 2017 at an event that my company hosted focusing on our experience bringing a connected device to market.

My talk focused on the software side of the company, presenting three lessons that we’ve taken to heart after two years in the trenches building a data-intensive hardware product. Slides available here.

At Whisker Labs, our goal is to unlock value from the electrical networks within homes.

Fig. 1. Whisker Labs prototype device installed on a circuit breaker panel.

I bet you don’t often think about your home’s electrical system. We just turn stuff on and off as we go about our day. And yet, this system is implicit in practically everything that goes on in a home. It’s always there, and almost everything you do in your home has an effect on it. By putting a finger on this pulse, we’re able to provide a lot of value to homeowners and our partner organizations that they wouldn’t otherwise have access to.

The technology involved has two parts. We make a hardware product which monitors the home’s electrical network. And we run a SaaS platform which ingests, processes, and delivers insights from the data.

Lesson 1: Mind your protocols and queues

Software bridges the gap between these two parts of our business. This software spans multiple computational footprints, from embedded devices up to the machines in our cloud.

A good way to contextualize these tiers of software is to consider the time scales at which they operate.

Fig. 2. Breakdown of the time scales at eachstage of the data ingestion pipeline.

Time scales

Closest to the metal (literally), we have our sensors. These are custom PCBs that run interrupt-driven tasks on a low-power MCU. The sole purpose of the code running on these devices is to pull data out of magnetic field sensing elements at a rate of about 2,000 Hz, which translates to about 500 microseconds per polling operation.

These raw sensor data are put into a ring buffer with capacity equivalent to ⅒ of a second’s worth of data. Ten times per second, the sensors populate their output buffers with new data. It is then necessary for the sensors to be polled at a rate of at least 10 Hz, or else data will be lost.

In order to avoid hammering our backend, the agent which polls the sensors buffers data in order to space out transmissions to our API. By default, the transmission rate is 1 Hz, but can be modulated either to decrease end-to-end latency for important data or to implement exponential backoff in response to a backpressure signal from the API.

At this point, data has made its way into our backend. But the data itself consists of unitless signal measurements. Turning these raw data into meaningful measurements of current, voltage, and power requires a bunch of math and a learning process which, over time, calibrates its output according to characteristics of the magnetic field environment of the customer’s circuit breaker panel.

At the end of all this, we have output values in terms of scientific units that can be either displayed to customers or fed into further analysis workloads. The time scale at this “application layer” for our data tends to be on the order of seconds, or in the case of historical analyses, days, months, or years.

One consequence of this gradation of time scales is our reliance on queues, or more specifically, ring buffers. I often joke that our system is ring buffers all the way down, because it basically is when you think about it. The way each stage makes the jump to a higher time scale is to queue data until a threshold duration’s worth has been buffered, and then flush the buffered data to the next stage. Unbounded queueing being a recipe for memory leaks, we use ring buffers to fix our buffering capacity at each stage.

What I’m describing isn’t novel; It’s basically network programming 101. I’ll refer back to this in a later section.

Protocols

Another consequence that we’ve taken to heart is the fundamental importance of protocols, or the formats in which data are encoded and the policies enforced around interacting with it.

In the web universe, JSON is still the linga franca. Within datacenters, it’s increasingly common to pass data between services in more efficient formats, like Protocol Buffers or Thrift. In the scientific community, HDF is the standard, being better suited for numerical and tabular datasets.

One thing that I’m glad we did up front was to put thought into what data formats we want to support for exchanging data between the components outlined above. Each transition represents a different set of constraints in terms of memory, computation, and time. Additionally, we foresaw the need for our stream-processing pipeline to support types of data other than those produced by our sensors. For instance, we feed system metrics from our sensors and hub through the same APIs as energy data, letting the backend multiplex different data streams to the appropriate downstream services.

Fig. 3. Example of energy time series data at a ~1-second interval.

What we came up with is an extensible data model based on typed matrices. For any given type of data stream (“energy data”, “system metrics”, “thermostat data”), we can define a series of channels which each have a datatype and a unit. For example, we represent “energy data” as a matrix whose columns convey values like voltage in volts, current in amperes, and power in watts. “System metrics” streams convey the usual suspects (CPU/memory/network utilization) with added emphasis on measurements relevant to embedded systems, like temperature and line frequency.

In the memory and bandwidth-constrained environments of our devices, we use a custom binary protocol which enacts the matrix representation of our data model. This protocol is itself extensible too, allowing us to annotate payloads with metadata or implement new compression schemes in a backwards compatible way.

As an aside, this is the context in which we developed the adaptive averaging compression algorithm which I wrote about last year.

Being mindful of the protocols we use continues to pay dividends. By tailoring our data formats towards extensibility and our specific constraints, we’ve been able to drastically reduce data sizes, reduce the rate at which we produce technical debt, and streamline the process of integrating with external data providers.

Lesson 2: Err on the side of conservatism

A second lesson that we’ve learned is that production-grade embedded development requires a level of conservatism not fashionable in popular software culture.

This lesson is derived from the basic premise that we have no control over devices once they leave the factory. If the device installation process fails, there’s no crash-reporting system you can rely on for debugging. Once online, a device can disappear at any time should the homeowner’s power go out or they change their wifi password. Or the device’s network connection may slow to a trickle a few times per day because their breaker panel is on the opposite side of a wall as a microwave.

This throws a wrench in our ability as fleet operators to ensure timely data collection or reliable upgrades to the devices’ software. When you put a thing in someone’s home, you lose the ability to make any assumptions about its availability and serviceability.

The low-hanging fruit mitigation strategy is to enforce rigorous testing throughout the development process and thorough QA before devices leave the factory. This decreases the likelihood of self-inflicted wounds from bugs into our own software. But many risk factors are outside of our control. What if a critical open source library stops being maintained? What if a vendor on which you rely goes out of business?

Being a small team with tight shipping schedules, we may not be able to afford the sudden disappearance of a giant on whose shoulders we’ve stood. In contrast to a similar situation in the web development sphere, like migrating a user auth flow from Parse to Amazon Cognito, you can’t exactly swap out your device’s runtime with 100% confidence.

To de-risk our embedded software, we’ve erred on the side of conservatism when choosing to tie our ship to any outside entity. For instance, we’ve resisted the nascent trend towards running bleeding-edge containerization tools on our devices. On the one hand, being able to hermetically isolate processes would be a boon for security. But we’re not comfortable betting our devices’ decade-long operating requirements on open source projects that garner 100 commits per week.

Lesson 3: Everything old is new again

I am of the opinion that “IoT” is more marketing term than an indication of a fundamental shift in technology.

Looking out on the consumer landscape, we’re inundated in a sea of smart devices. Every appliance manufacturer under the sun is trying to sell us more expensive versions of their products that have screens or can send us text messages.

On the research end, if you read the literature on the Internet of Things, you’re vaulted into a near future in which our world is transformed by ubiquitous, networked sensors. No human desire or action is left unchanged. We are about to live in a Charles Stross novel.

After working with this stuff for a few years, I’ve started to wonder how the IoT trend will leave its mark on the software industry itself. I’m hopeful that as more developers work on or around connected devices, we’re collectively getting a chance to relearn how computers actually work.

In years past, the advance of Moore’s Law led to incrementally higher-level innovation in how software was developed and run. First we got structured programming and higher-level languages. Then operating systems and virtual memory. Then virtual machines.

As the Internet grew in the 90s we saw the rise of Java as a client-side platform. And then when that didn’t work out, JavaScript ascended. Now the browser is the lowest-level abstraction that many developers have to think about.

Provided the IoT sector finds its footing, this ever-present march towards higher-level abstractions will inevitably paper over many of the concerns of the underlying hardware and network topology. But I think we’re safe in assuming that for the next handful of years, the trend towards ubiquitous computing will rely on low-power hardware platforms that communicate under variable networking conditions.

Fig. 4. Photo of Margaret Hamilton working on an IoT product.

In a small way, these adverse characteristics harken back to the good ol’ days of programming at places like MIT’s AI Lab in the 60s or Bell Labs in the 70s. Working with resource-limited environments requires a fundamental understanding of how computers work. In all, this makes the space an enriching one to work in.

Vinyl records are resurgent and more people are having to care how many bytes their programs occupy in memory. Everything old is new again.

Conclusion

Throughout the lessons discussed above (the importance of protocols and buffering strategies, conservatism with respect to dependencies, and the value of the tried and true ways of doing things), an overarching theme emerges—there isn’t a “default path” one can take when building software for a connected device.

For many companies, there’s a well-trod path of battle-tested solutions to most problems that arise. Need a webapp? Use Rails or React. Website slow? Use Memcached or Redis. Deploys slow? Use Docker, I guess.

In the IoT world, you’re in less-charted territory. Your problems are more apt to be something fundamental, like how your device is using the network or how the time scales of operation align between devices or how your device is translating physical stimuli into a stream of numbers.

I, for one, relish these kinds of problems that require you to sink your teeth in. And I hope we as an industry use them as an opportunity to grow.

Thanks to Fikreab Mulugeta for reading and providing feedback on drafts of this essay.

Aiming for sustainability

2017-03-21T00:00:00-07:00

My last post dealt with the subject of “systems thinking”, claiming that its fashionableness in the tech industry doesn’t jive with the amount of collateral damage being racked up by prominent companies.

I wanted to follow up with a post that doesn’t spend 600 words tearing down other people’s work. To that end, I’d like to pose the question of what a company would look like if it were to truly espouse systems thinking.

To characterize a possible outline of such a company, I think that it comes down to developing a sustainable business model before being coerced by market forces to make the business work by any means necessary.

Sustainability

The word “sustainability” ceased to convey meaning years ago. It’s become more of a Boy Scout merit badge than an objective against which key results can be measured.

But to take a step back and define our terms, for a process to be sustainable, it must be capable of continuous operation without the aid of external forces. In the realm of business, the process to be sustained is growth, presumably of revenue. A firm can’t simply remain in operation to be considered successful—it must keep pace with (or beat) the greater economy.

This fact of business life provides the primary point of pressure on management—growth must be sustained, if not accelerated, or else it’s your head.

A company must overcome this pressure in order to have the luxury of even considering its adverse impacts on society. A management team kept up at night by fear of investor insurrection is not one that’s apt to fret about automation’s impact on the working class.

A company that fails to overcome this pressure is akin to a truck rolling downhill with its brakes cut. As more people pin their hopes and dreams to the company, management starts to lose political capital and control of their own destiny. As things get dicier, it starts looking preferable to steer the truck towards the crowded intersection in order to stay on the road instead of maintaining course at the brick wall ahead of you.

A business earns the luxury of considering social impact by enacting a successful business model. A sustainable bottom line is the price of entry.

Luck

That sounds all well and great, but one does not simply “enact a successful business model”. This is a very difficult thing that most management teams fail to do.

Two contrived examples of how this may occur:

As a consequence of identifying a problem that your team is uniquely capable of solving, you achieve bottom line growth before investors lose patience.
You raise money from folks that trust you deeply, and this trust allows you to invest early capital in experimentation in pursuit of a sustainable business model.

Both of these scenarios involve quite a bit of luck. But this luck is earned, either by the team’s execution on a suitable problem or by building and leveraging trusted relationships.

Alignment

To me, the crux of this process is wrapped up in the motivations of a founding team. A team motivated by the fashionableness of starting a company as a get-rich-quick scheme is not one that’s likely to be mindful of negative externalities.

I contend that you’re better off executing on a problem that you find truly meaningful and likely to have a positive impact on society. These characteristics do not supersede the requirement of making money, but can be thought of as filters on the paths down which a company may traverse.

Starting a company is hard work. Minimizing a company’s exploitative fallout on society requires patience, some amount of privilege, and fundamental alignment among the founding team.

Thanks to Oren Schetrit for reading and providing feedback on drafts of this essay.

Systems Thinkpiece

2017-02-13T00:00:00-08:00

Luminaries in the technology industry love to trumpet systems thinking. A “systems thinker” is one who possesses the ability to reason holistically about the relationships between independent components of a system. This notion is recursive in the sense that each component may itself be a system composed of interconnected parts.

In practice, companies tout systems thinking as a means towards the nebulous ideal of “innovation”. We like to hire “systems thinkers” because our companies are complex, special organizations that need to be properly cared for and fed by smart people. But really, I think that true systems thinking manifests elsewhere in a corporate setting. An enlightened company thinks of ways to minimize their adverse impact on customers’ lives and society as a whole. Such a company may disrupt their competition, but leave society intact, if not improved.

Two examples, admittedly focusing on the negative because it’s easier to call attention to:

The fast food industry is bad at systems thinking. Their success is inversely correlated with the physical health of their customers.
The financial industry has a centuries-long track record of being bad at systems thinking. Time and time again, recessions are caused by the over-exploitation of markets and people. In terms of systems, expressions of laissez faire capitalism show a contemptible disregard for the consequences of exploitative actions on powerless people.

Tech

How does the technology industry fair when viewed through this lens? To frame the current moment in history, I think technologists’ collective vision of the future has come into stark contrast with reality. Look at basically any peppy video advertisement from a startup and you’ll see a world inhabited solely by affluent, tech-savvy people. The primary motivators when these people make purchasing decisions are the degree to which a product or service can make them feel “connected” and maximally productive.

This vision, one in which Maslow’s hierarchy itself has been disrupted, isn’t a vision that represents the bulk of humanity. In its dominant guise driven by venture capitalists, I don’t think Silicon Valley is as good at thinking about systems as it thinks it is.

Specifically, the current iteration of the tech industry fails to take into account and plan for the societal impact of its offerings. We want a driverless future, but it’s not our job to think about how we’ll maintain the livelihood and dignity of taxi-, truck-, and bus-drivers. We want an open and universal communications network, but refuse to acknowledge our consequent role in defending the truth and policing hate speech. We want to institute a sharing/gig economy, but give zero thought into how this would impact the health care system or how people find time to raise children amid a workweek full of discontinuous tasks carried out for rich people.

Closing

When the financial industry’s unchecked expansion into predatory lending cratered the global economy, we were up in arms, calling for heads to roll. Now, by giving a megaphone populists, social media has helped put an authoritarian administration in the White House and the specter of white supremacy back into the mainstream. What is to be done when the leadership at Facebook and Twitter fail to even acknowledge their platforms’ inadvertent role in these developments?*

We can’t build an equitable future when the powers that drive the narrative (namely, prominent executives and VCs) fail to recognize technology’s adverse effects. When will Silicon Valley have its Oppenheimer moment?

Thanks to Marc Hedlund for reading and providing feedback on drafts of this essay.

* I realize that this is reductive. Also, the massive outpouring of activism since the election would be impossible without social media, so the megaphone goes both ways.

Introducing JSON Toggle

2017-02-07T00:00:00-08:00

This post proposes JSON Toggle, a JSON document structure for specifying feature toggles. This format is being used at Whisker Labs with a Java 8 library which we’re open-sourcing as a proof-of-concept for JSON Toggle.

Background

Feature Toggles (or feature flags) are a programming mechanism used to dynamically configure running programs. A team utilizing feature toggles can modify a software system’s behavior without having to redeploy code or restart processes with new configuration. When used judiciously, feature toggles can dramatically increase a team’s rate of experimentation and delivery.

The state of a collection of feature toggles is typically read at runtime from some side channel such as a database or a config file. This state can be updated out-of-band with tooling (e.g. a dashboard) and automatically distributed to running applications. At Twitter (where I used to work), the company-wide feature flag system provides the software equivalent of a railroad switching station, giving teams dynamic control over which codepaths were enabled for which users. This level of flexibility unlocks the ability to roll out new features and perform large refactors with a minimum of pulled-out hair.

A protocol for defining feature toggles

Since their emergence in the late aughts, feature toggles have become a common pattern at internet companies and have made their way into the enterprise. Despite this, there has never been a successful effort to standardize the means by which toggles are configured. By and large, the state of the art has each organization building an implementation from scratch and maintaining a bespoke distributed CRUD app to manage them.

It’s shocking to me that there isn’t an equivalent of the statsd protocol for feature toggles. Regardless of the storage and distribution mechanism used, a protocol would at least allow us to converge on a standard for specifying toggle configuration.

With this in mind, I’d like to propose JSON Toggle, a feature toggle specification format that we’re starting to use at Whisker Labs.

JSON Toggle defines a JSON document structure for parameterizing a set of feature toggles. Such documents, called “toggle specifications”, may be used to enact weighted probabilities that determine whether or not a feature is enabled for a given request.

JSON Toggle is language-agnostic in the sense that ingestion libraries may be implemented in any programming language.

An example toggle spec

In the following example toggle specification, three toggles are defined:

"/feature/ab_test" which acts as a simple coin-flip with 50% probability.
"/feature/dogfood_widget" which can be used to guard a feature that is 100% accessible to employees but inaccessible to all other users.
"/feature/incremental_rollout" which grants access to all employees, but only 1% of non-employee users.

The spec looks like this:

[
  {
    "key": "/feature/ab_test",
    "value": 5000
  },
  {
    "key": "/feature/dogfood_widget",
    "filter": [{
      "type": "cohort",
      "target": "employee",
      "value": 10000
    }],
    "value": 0
  },
  {
    "key": "/feature/incremental_rollout",
    "filter": [{
      "type": "cohort",
      "target": "employee",
      "value": 10000
    }],
    "value": 100
  },
  ...
]

Or equivalently in YAML:

---
- key: "/feature/ab_test"
  value: 5000
- key: "/feature/dogfood_widget"
  filter:
  - type: cohort
    target: employee
    value: 10000
  value: 0
- key: "/feature/incremental_rollout"
  filter:
  - type: cohort
    target: employee
    value: 10000
  value: 100

For each specified toggle, an integer value from zero to 10,000 represents a probability (out of 10,000). These values can be thought of as weights for predicates (boolean-valued functions) that are used in application code to guard whether or not a feature is enabled for a given request.

Toggles can additionally specify filters which apply different probability values to certain types of requests. For example, a filter could be used to target a cohort of users, such as “employees” or “beta testers”.

Toggle Specification specification

An individual toggle definition is broken into three components:

A required key string which uniquely identifies the toggle within the toggle spec.
An optional filter property, defining a list of filters which define “special case” branches of the toggle.
- For instance, a cohort filter is used to match a toggle invocation with a cohort. A cohort target could identify a subset of a userbase (e.g “employees”), an IP range of incoming requests (e.g. 67.174.128.0/24), or any other subdivision relevant to an application.
- Filters should be evaluated in the order that they appear in the toggle specification.
A value property, which sets the “base” value for cases when no filter applies to a request. Values are specified in basis points and define toggle probabilities out of 10,000. Thus a toggle value of 5,000 will result in a toggle probability of 50%.

A Java 8 library for working with JSON Toggle

In addition to this protocol, I’d like to share an early version of a JSON Toggle ingestion library that we’re using at Whisker Labs for our Java services. toggle is a Java 8 library which implements the functionality described above using java.util.function primitives. It supports toggle specifications stored in Amazon DynamoDB tables or JSON/YAML files, and offers a caching decorator powered by Caffeine.

With this library, we can do things like this:

// Construct a caching `ToggleMap` backed by a DynamoDB table.
Table dynamoDbTable = dynamoDbClient.getTable("production-toggles");

ToggleMap<String, Integer> toggleMap = new CachingToggleMap<>(
  new DynamoDbToggleMap<Integer>(dynamoDbTable),
  "maximumSize=1000,expireAfterWrite=1m"
);

// Create a toggle backed by the "/feature/new_hotness" definition.
Toggle<Integer> fancyNewFeature = toggleMap.apply("/feature/new_hotness");

// Use the toggle to guard some new functionality, based on a user ID.
if (fancyNewFeature.test(user.userId)) {
  // New hotness.
} else {
  // Old and busted.
}

Nomenclature unabashedly cribbed from Finagle.

Now what?

If JSON Toggle is to become a thing, we’ll need to write ingestion libraries in other programming languages and various tools to make it easy to manage with toggle specifications.

If you think this approach could be useful or that it’s a stupid idea, I’d love to hear from you. In lieu of an official channel like a mailing list, for the time being, please reach out via email or by filing an issue on the toggle project on GitHub.

Thanks to Rishi Ishairzay for reading and providing feedback on drafts of this essay.

Synthesis over invention

2016-03-31T00:00:00-07:00

I recently finished a great book called Hilbert [1], an eponymous biography of the mathematician David Hilbert. It’s a fun read and conveys an important message for anyone who strives to be creative in a technical field. In contrast to the stereotype of the inventive genius who advances a field by creating fundamentally new concepts, many of the innovations to which Hilbert’s name is attributed are instead examples of synthesis, or the process of combining different existing theories into a single system.

This is a critical distinction which I don’t feel is emphasized enough in the contemporary software development and computer science community.

Nerd nostalgia

Beyond capturing the life of an intellectually preeminent man, author Constance Reid does a fabulous job of snapshotting the energy and unprecedented output of the mathematical society centered around Göttingen, Germany at the turn of the twentieth century.

If you were to trace the lineage of virtually any subfield of modern mathemathics or physics, Göttingen would stand out as an inexorable nexus of theory and innovation. Any subset of the list of folks who passed through would be an intellectual hall of fame: Gauss, Riemann, Dirichlet, Born, Oppenheimer, Teller, Dirac, Planck, Einstein, Noether, Klein, Schopenhauer, Fermi, von Neumann, Heisenberg.

Up there with the Manhattan Project and Bell Labs in the 70s, Göttingen is one of the storied valhallas of nerd nostalgia. Every few decades since the industrial revolution (and before, but with far lesser frequency), some cloistered organization attracts a critical mass of productive brain power and is canonized in history books as a hotbed of R&D. The Georg August University of Göttingen was one such place from its heyday in the mid-19th century up until the German braindrain of the 1930s. Add to this the era of geopolitical turbulence overlapping Hilbert’s lifetime, and you’re left with a fascinating slice of history.

A synthetic character

In the wake of this book, I’m left ruminating over certain characterizations of Hilbert’s work and the historical setting in which he lived. One particularly compelling specialty that Hilbert possessed was an ability to synthesize disparate concepts into cohesive theories that at once fundamentally broke ground and unified different fields.

Hilbert’s student and fellow mathematician, Otto Blumenthal:

“For the analysis of a great mathematical talent, one has to differentiate between the ability to create new concepts and the gift of sensing the depth of connections and simplifying fundamentals. Hilbert's greatness consists of his overpowering, deep-penetrating insight. All of his works contain examples from far-flung fields, the inner relatedness of which and the connection with the problem at hand only he had been able to discern; from all these the synthesis — and his work of art — was ultimately created.”
C. Reid, Hilbert, 1st ed. New York: Copernicus, 1996, ch. 24, pp. 208.

To paraphrase, Hilbert is best known not for inventing lots of fundamentally new science, but for finding and leveraging commonalities between the many good things that his contemporaries came up with. Rather than endlessly adding to an intractable pile of mathematical novelty, Hilbert excelled at synthesis, or the process of combining different theories into a single cohesive system.

In an ambitious age of unbounded possibility and imagination, what better strategy than to seek depth of connections and simplifying fundamentals.

Wherein software eats my book report

This notion has important and obvious lessons for software developers. The existential balance between novelty and synthesis in theoretical mathematics is equivalent to maintaining a balanced diet of abstraction in a computational system.

For example, I would go so far as to say that there is a trend in certain niche programming language communities to culturally appropriate as much from the fields of modern mathematics as possible. This has several adverse consequences, most immediate of which is the imposition of an immense educational burden on newcomers. Longer term, when projects are built around arcane abstractions that few people truly understand, they are inevitably used “improperly” because consumers don’t know any better.

I think that the degree to which a project exposes overly-conceptual abstraction is directly related to the rate at which technical debt accumulates in consuming code. Such cases do all of us a disservice and lead to the eventual marginalization of the abstraction-laden technology.

To me, it’s interesting to think about where the pendulum of creation vs synthesis currently lies in the world of software development. We are awash in heady topics that we collectively feel remiss for not understanding well enough. From elliptic curve cryptography to block chains to consensus protocols to region-based memory management to whatever the hell a monad actually is, there are an increasing number of gaps in our domain expertise and few unifying systems through which to understand them.

I get the feeling that lately we as an industry put too much emphasis on novelty and not enough on unification. We are easily titilated by the newest programming language or the latest gizmo that can fork a process on a far away computer and less interested in ideas that fundamentally simplify how we reason about and work with computers.

Towards synthesis

I’m humbled by Hilbert’s definitive ability to consolidate and simplify. Read by a software developer, the book is a call to action pitting the goal of synthesis against our tendency to run as fast as we can in a million directions at once.

By emulating David Hilbert and studying disparate fields with an eye towards unification, we can produce powerful tools without requiring that our users read 40 whitepapers before understanding anything.

Thanks to Rishi Ishairzay, Marcel Molina, and Arya Asemanfar for reading and providing feedback on drafts of this essay.

References

C. Reid, Hilbert, 1st ed. New York: Copernicus, 1996.

Adaptive compression of periodic signals

2016-03-02T00:00:00-08:00

What do all compression algorithms have in common?

They usually involve some fancy math, but when it comes down to it, their defining tactic is to exploit properties of their input in order to encode the data in fewer bits. Thus a given compression algorithm is typically best applied to a specific type of data. Inversely, given a type of data, the process of selecting an appropriate compression algorithm requires one to precisely identify an exploitable characteristic of the data.

In this article, I’ll introduce a data-compression problem that we faced at Whisker Labs, chart a course through several fields of research, and describe a technique that my colleagues and I developed to address the problem.

An introductory example

Let’s start with a example. Imagine that you’re tasked with building a device which must measure the electrical current going through a 60 Hz AC circuit. The device uses a sensing element to measure current, produces a time series of readings, and then transmits these data to a server for offline processing.

Because alternating current is defined as a sinusoidal function of time, instantaneous measurements are meaningless in isolation. That is to say, since the value of current oscillates periodically, no individual sample will provide a meaningful representation of the dataset as a whole. At best it provides you with a snapshot of the signal’s amplitude at some arbitrary time. To arrive at useful current measurements, you have to capture the full current waveform and then typically compute a quadratic mean to arrive at a numerical result in Amperes.

Fig. 1. A single period of a simple sin wave, y(t) = sin(2πt).

In order to produce data that characterizes the 60 Hz current waveform, the sensor has to sample the circuit at least 120 times per second. In practice, much higher sample rates are required because electrical current can fluctuate on time scales much shorter than one second. Given a sample rate of 1 kHz, if each datapoint is a 16-bit integer, our sensor entails bandwidth of at least 16,000 bits/second or just under 2 kilobytes/second. This may not sound like a lot in the modern era of Big Data™ until one considers the constrained networking environment in which such sensing devices typically operate. For example, an electric utility may deploy these devices on ZigBee networks, which have maximum data rates necessarily measured in kbps. Or the utility may splurge for access to a cellular data network, in which case our sensor’s requirement of almost 40 gigabytes of data per month becomes exorbitantly expensive at any meaningful deployment scale.

So we’ve got ourselves an objective based on a constraint: to reduce the bandwidth utilization of our hypothetical sensor.

Characterizing the data

By definition, data compression involves taking advantage of certain properties of a dataset in order to represent it with fewer bits of information. A simple example is the application of run-length encoding (RLE) to a set of small positive integers, which can result in the elimination of the integers’ leading zeros (in two’s complement).

To apply this strategy to our sensor’s output, we first need to identify patterns in the data and then figure out how we can exploit those patterns. The most ripe characteristic of our dataset has already been mentioned – the fact that AC circuits produce sinusoidal data. For a constant current, the sampled data would form a simple sinusoid (the most basic example thereof is illustrated in Fig. 1). Such a signal could be encoded in only two quantities: the waveform’s amplitude and phase angle. Note that the frequency parameters of the wave equation are fixed for a 60 Hz circuit.

But of course our example is not that simple. As in electric utilities’ use cases, we need to be able to monitor realistic circuits, such as those of a home, commercial building, or data center. These types of circuits entail variable electrical loads, which produce data that are still fundamentally sinusoidal, but are much more messy (see Figure 2).

Fig. 2. An example current waveform produced by a residential electrical load during a state transition. Y axis units are omitted because the sensor output is technically unitless. The amplitude is proportional to current, but must go through a calibration operation to produce measurements in amps.

It turns out that there’s an existing body of research related to signals like these. Whereas vanilla composite waveforms (e.g. Figure 1) are periodic in the sense that they repeat over and over forever, a signal can be termed pseudo-periodic if it can be subdivided into discrete segments of periodicity [2]. An example pointed out in [2] of a pseudo-periodic time series is a heart-beat:

A data set often will exhibit great regularity without exactly repeating. For example, heartbeats always have the characteristic “lub-dub” pattern which occurs again and again, yet each recurrence differs slightly from each other. Some beats are faster, some slower, some are stronger and some weaker. Sometimes a beat may be “skipped”. Nonetheless, the overriding regularity of the heartbeat is its most striking feature.
William A. Sethares, Repitition and Pseudo-periodicity

Interestingly, this property also applies to the audio waveforms of music, leading to applications in compression and rhythm analysis [3]. For the purposes of this article, time series of electrical current measurements also match this definition. This is apparent by visual inspection of Figure 2, wherein the signal can be divided it into three distinct regions:

Time range	Characteristics
0 - ~220 ms	Low-amplitude, periodic
~220 - ~440 ms	Transient, non-periodic
~440+ ms	High-amplitude, periodic

Analogously to how one’s heart-rate fluctuates throughout the day during periods of excitement or lethargy, the electrical current going through a circuit fluctuates as connected devices turn on and off. For our hypothetical sensor, this manifests as a mostly-repeating sequence of data, interjected by brief periods of perturbation as the signal shifts to a new pattern.

Leveraging pseudo-periodicity

So we have fancy terminology to describe our data, so what?

Let’s take a step back and consider how a fully periodic time series could be encoded compactly. By definition, the data repeats itself over and over for the duration of the dataset. A variant of run-length encoding could be used, where instead of eliminating sequences of repeated zeros or ones, entire bit sequences would be on the chopping block. Put another way, a periodic dataset presents a similar opportunity to that presented by the collection of positive integers mentioned above, but differs in the cardinality of the repeated pattern.

This principle applies in kind to pseudo-periodic data, provided we can identify subsequences that are sufficiently periodic. Given a time series that can be segmented into locally-periodic regions, we can encode the periodic parts with a single cycle of the data and an integer number of cycles over which the cycle repeats. This would be analogous to deflating a run into a single value and a count in RLE.

This means that for arbitrarily-long sequences of steady-state current readings, all we need to convey is a single 60 Hz cycle’s worth of data and an integer indicating how long the cycle repeats. The efficacy of this technique scales linearly with the duration over which it’s applied – deflating a second’s-worth of steady-state data results in a compression ratio of 60:1, 2 seconds produces 120:1, etc.

Algorithm formulation

In order to make use of this compression technique, we need an automated way to determine that a sequence of data is periodic. This property is often easy to pick out visually, just as the regions of periodicity are apparent in Figure 2. If we divide a periodic region into its constituent cycles and overlay them on top of each other, it’s clear that the cycles have the same shape:

Fig. 3. Forty cycles of electrical current samples from a region of periodicity overlaid atop each other. The inter-cycle spread is largely due to random noise imposed by imperfections in the hardware sensing elements.

In order to implement the compression technique described above, an algorithm is needed to detect periodicity. Such an algorithm could be deployed to operate on buffered segments of these time series data, resulting in streaming compression well-suited to sensor output.

Academic interlude

Methods for the detection and characterization of pseudo-periodicity have been a relatively popular sub-field in academic literature since the mid-aughts [5, 6]. The publications we surveyed share an overarching goal of developing an algorithm to determine the precise mathematical description of a signal’s periodic regions. In contrast to our need to simply detect periodicity, the literature was far more general than was our goal. However, there were of course fruitful commonalities, including amplitude mismatch [5] between cycles within a periodic region (which is relevant to any imperfect sensor which measures things in the real world) and the strategy of computing a correlation metric by comparing every ith value in a set of cycles [6].

Quantifying deviance

Rather than diagnosing a signal’s precise parameters and template function, all we want to do is detect whether or not a time series is periodic. We could then buffer sensor data for some length of time and apply our detection function on the buffered windows of data. If the function returns true, then we know that the window of data can be reasonably represented as a single cycle, and thus compressed down by a significant factor.

It is worth noting that our compression technique is lossy. As illustrated by the spread on the y-axis of the lines plotted in Figure 3, cycles’ amplitudes don’t exactly match even in periodic regions. Thus we don’t maintain the full raw data when encoding a long sequence as a single representative cycle. However, in practice we don’t actually lose useful information because the amplitude mismatch can largely be chalked up to minor random noise imposed by imperfections in the sensors’ physical components. In effect, our technique has the added benefit of smoothing the sensor data, if anything.

Our intuition was to apply a standard sum of squares measure of variance to a representative sample of sensor data in order to select a good candidate to serve as the trigger for our algorithm. We first played with a residual sum of squares and ended up choosing root-mean-square deviation because its results are closer in terms of scale to the input data. That is to say, the RSS grows quadratically as variance increases, whereas the RMSD will grow linearly.

With a measure of variance in hand, this leads us into a precise definition of our algorithm.

The algorithm

At a high level, our streaming algorithm will take as input a chunk of buffered time series data, determine whether or not it is periodic, and if so, return a single-cycle representation of the data. Under the hood, we compute the root-mean-square deviation to assess the periodicity of the input.

Given n seconds’ worth of buffered time series data with a known cycle frequency, our “adaptive averaging” algorithm involves three steps:

Averaging: Compute a cycle average by averaging the corresponding samples of all of the buffered cycles. Put formally, for a set of cycles each comprising m data points, from i = 0 to m compute the average of the ith data points of every cycle.
RMSD: Taking the cycle average computed in step 1 as the estimator, compute its root-mean-square deviation with respect to the raw cycles themselves.
Threshold-comparison: Compare the computed RMSD against a predefined threshold to produce a boolean value indicating whether or not the cycle average is sufficiently representative of the data.

The RMSD can be thought of as an error metric for assessing the cycles’ closeness to one another. If this error metric exceeds the threshold, then we cannot reasonably eliminate cycles. If the error metric is lower than the threshold, then we can consider the cycle average to be “close enough” to each raw cycle.

An example is shown in Figure 4. Here we’ve applied the adaptive averaging algorithm with a one-second window size to a sequence of data surrounding that shown in Figure 2. The steady-state cases result in low RMSD values whereas during transient periods of change, the RMSD spikes considerably.

Fig. 4. A ten-second snapshot of pseudo-periodic time series data and the RMSD values produced by the adaptive averaging algorithm with a one-second window size.

Example and results

Bringing this back to our sensor case study, in the steady-state case, we can compress the time series of current measurements down to a cycle average and an integer indicating the number of repetitions that the time series covers. The algorithm is adaptive in the sense that it adapts to the degree of periodicity in the data. This manifests in high compression during steady-state and bursts of lossless data during intervals of change. So we get low data size when current is steady, and then when a fridge or an HVAC system kicks on, we get a brief spike before the current levels off at a new steady-state.

The compression ratio scales linearly with the duration over which data is buffered. The following table shows expected steady-state compression ratios for varying window sizes in terms of the measured quantity’s period, T:

Window size, in multiples of T	Compression ratio
1	1:1
10	10:1
n	n:1

Thus, if we were to use a one second window size for our sensor measuring the current of a 60 Hz circuit, we should observe a compression ratio of 60:1 in steady-state cases. Compared to our original bandwith of 16,000 bps, our sensor would emit data at an average rate under 300 bps. This translates to less than one GB/month of total bandwidth, making the aforementioned ZigBee or cellular communication use cases much more feasible.

Other investigated areas of research

Along the way towards coming up with the cycle-averaging + RMSD idea, we investigated a number of areas of research not mentioned thus far. While not as applicable in the end, they made for very interesting reading and drove home the point that there’s more than one way to skin a cat.

From our stream of raw samples, we could feasibly compute a fast Fourier transform to decompose the signal into a set of complex numbers or (amplitude, phase) tuples for each of the harmonics larger than some threshold. On the server, we could then plug these wave equation coefficients into sin functions and be done with it. This is a good option (and one we may eventually implement), but we’ve thus far found adaptive averaging to be Good Enough for our immediate data size requirements.

Facebook’s recent paper on their in-memory time series database called Gorilla [1] contains an entire section on time series compression, focusing on delta-of-delta encoding for timestamps and an XOR encoding scheme for values. These weren’t found to be fruitful for our data, particularly because the techniques outlined in the paper rely on the fact that the measurements made by software monitoring systems don’t often fluctuate on small time scales. Our data is much higher-frequency and changes constantly. It was Gorilla’s use of delta-of-delta encoding however that led us down the path of investigating the idea of comparing ith data points across adjacent cycles.

We went down an indulgent path regarding discrete wavelet transforms, a class of functions similar to Fourier transforms that are often used in image compression. At the core of wavelet theory is the notion of decomposing a continuous signal into a discrete series of scaled basis functions. This is compelling, but fell into the same camp as [5] and [6] in seeking a much more complicated outcome than simply detecting a specific property of a time series. Wavelets are a rather impenetrable field of study, but we found [4] to be a reasonable summary.

Conclusion

As we’ve seen, it behooves the bandwidth-concious to be aware of the patterns and properties of their data. By exploiting a property of our specific type of data called pseudo-periodicity, we were able to reduce the average-case size of our real-world sensor data by an order of magnitude.

Update: The compression technique described in this essay has since been deployed fleet-wide at Whisker Labs, resulting in a roughly 84% reduction in bandwidth and overall data size.

@evanm A fun morning: observing an 84% fleet-wide bandwidth reduction due to our adaptive averaging algorithm pic.twitter.com/j26HXUeqYo
— Evan Meagher (@evanm) March 10, 2016

Thanks to Steven Lanzisera, Wilhelm Bierbaum, and Johan Oskarsson for reading and providing feedback on drafts of this essay.

References

T. Pelkonen, et al, "Gorilla: A Fast, Scalable, In-Memory Time Series Database," Proceedings of the VLDB Endowment, v.8, n.12, p.1816-1827, August 2015. (link)
W. A. Sethares, "Repitition and Pseudo-periodicity," Tatra Mountains Mathematical Publications, Publication 23, 2001. (link)
W. A. Sethares and T. W. Staley, "Meter and Periodicity in Musical Performance," Journal of New Music Research, August 2010. (link)
C. Valens, "A Really Friendly Guide to Wavelets," 1999. (link)
H. Wong and W. A. Sethares, "Estimation of Pseudo-periodic signals," Dept. of Electrical and Computer Engineering, University of Wisconsin-Madison, May 2004. (link)
M. Small and J. Zhang, "Detecting and describing pseudo-periodic dynamics from time series," Hong Kong Polytechnic University, August 2007. (link)

Design documentation at small companies

2015-09-24T00:00:00-07:00

One component of the engineering culture at Twitter (where I used to work) that I’m trying to instill at my new job is the importance of writing design documents prior to implementing complicated systems. In this essay, I will argue in favor of premeditated software design at small companies and propose what I call “precautionary migration planning” as a design doc section that caters specifically to the tradeoffs required by startups.

Traveling by map

A design document is an outline of a proposed design for a software system in writing and figures. The level of detail and formality can vary, but the purpose is to force an engineer to think about and document what a system should do and how it should be built before effort is spent on implementation.

Many large companies enforce design docs for all new projects, going so far as to prescribe document templates and design review meetings. While such a formal approach makes sense when projects require coordinated effort across multiple teams and scores of people, it would be an inappropriate amount of overhead for an engineering team at a startup.

But the baby shouldn’t be thrown out with the bathwater. Writing down and examining your thoughts prior to acting on them is a good way to avoid mistakes and prevent unwarranted technical debt. As such, even at a startup, going through a semi-formal design exercise injects a healthy amount of peer-review into the process and can increase the reliability of the systems you end up with. Not to mention the added benefit of having a good understanding of a project’s scope and thorough high-level documentation prior to writing a single line of code. Ideally, when you bring new folks onto the team, you can simply link them to a set of design docs and save yourself an hour of whiteboarding.

A straightforward analogy helps illustrate when a design doc is appropriate for a new undertaking. A design doc is like a set of directions and a map. The complexity of a journey determines whether or not directions are required. For instance, you can walk up the road to the grocery store without thinking, so you obviously don’t need a map. Similarly, if all you need to do is add a simple feature or fix a simple bug, then a rigorous design process is probably unnecessary.

However, for trips venturing into unfamiliar territory or requiring multiple vehicles, coordinating travel with a set of directions is a must. Likewise, if a system at the core of the company’s business has many moving parts and will affect the lives of numerous people over its lifetime, then a design doc will probably prove to be worthwhile.

External dependencies

After writing the first couple design documents at Whisker Labs, I’ve noticed a key difference between what I’m writing now and those I wrote at Twitter. Critically, the former tend to rely on the availability of services maintained by unfamilar people at other companies rather than acquaintances down the hall. For instance, by making use of Amazon Web Services instead of technologies stewarded in-house, our services’ uptime is reliant on the diligence of anonymous Amazon personnel. As the swashbuckling systems cliché goes, you own your availability, but you aren’t in control of all of the factors from which it derives.

Strategies exist for managing the impact of intermittent outages of third-party services. RPC interactions can be augmented with features like retries and failure accrual, or can simply return partial results as a means to limit the damage caused by temporary downtime. But at a higher level, years of experience with as-a-service offerings have shown that there is typically a threshold scale beyond which any given hosted service ceases to be economical. What we’ve observed is that almost all companies who bootstrap their software atop whatever-as-a-service solutions eventually move away from them on account of cost, reliability, and/or functionality. In the long term, everybody ends up running their own Graphite and Kafka clusters and the luckiest of us get our own datacenters.

Not to mention the trend of services simply disappearing out from under you, on account of the originating company being acquired or otherwise going out of business.

But for a scrappy, bandwidth-constrained startup team, paying someone to do the heavy lifting of distributed systems operation is a no-brainer. So what does a responsible software engineer do in such cases when business and productivity concerns demand the usage of hosted services regardless of their long-term feasibility?

Precautionary migration planning

The easy (and industry-standard) answer is to throw up your hands and say “we’ll cross that bridge when we get there.” The pricing and long-term viability of external services is entirely out of your control, so why worry about hypothetical futures that you can’t influence? People still live in Seattle and Portland even though the mega-quake is coming, right?

This is a fine answer if you’ve made the conscious decision that your #1 priority as an engineering organization is speed of execution. Depending on your product or service’s reliability requirements, the pace of your market, and your bottom line, it very well may be preferable to put your time to more immediately productive use than planning for eventualities.

On the other hand, deciding which failure modes are worth planning for is part of what makes engineering interesting. The best you can do to minimize the risk imposed by external dependencies is to come up with a feasible (but brief) plan for migrating away from them. Consider it a precautionary principle for SaaS.

This is why I’m starting to bake such a section into the design docs that I’m writing. They follow the same principles of situational awareness and premeditated action that motivates having runbooks for services, but are more akin to a heart transplant than a simple runbook item. The sections will:

List the system’s external dependencies whose long-term feasibility is deemed at risk (i.e. “<PaaS> will be too expensive by the time we hit <milestone>”)
List potential replacements for the risky dependency and give a high-level plan for migrating

The result of this exercise is a better understanding of a system’s risk profile and the paths by which the system is likely to evolve over time.

Countering the logical conclusion

In response to my initial thoughts on this strategy on Twitter, an esteemed former colleague pointed out its logical conclusion, in which the list of “hosted services” is exhaustive. In literal terms, a program’s “external dependencies” include the operating system and proprietary hardware on which it runs, all the way down to the utility company that supplies the energy powering the computer. In this light, precautionary migration planning is absurd, given that the engineering effort involved in reinventing every wheel between your program and electrons in circuits is well beyond most companies’ capabilities.

However, I don’t think that this argument refutes the usefulness of such planning. When done pragmatically, focusing on a reasonable subset of a system’s dependencies, a team gains the ability to act quickly when migrations are deemed necessary.

One way to differentiate external dependencies is by whether or not they are truly fundamental to a service’s operation. If the power goes out, a program (or at least a stricken instance thereof) is unrecoverable regardless of any migration plan. Thus such planning is only relevant for partial failure modes, such as the loss of a hosted database or the end-of-LTS for a specific operating system version.

Conclusion

Even small ships carry maps. I’ve made a case for the use of design documents at startups, but a key takeaway is that their use varies from organization to organization. For some businesses, time spent planning for hypothetical futures is not time well spent. For others, it’s a valuable hedge against undesirable outcomes.

Experience has shown that once an engineering organization reachs a certain size, a reasonably-rigorous design process is well worth having in place. A startup team’s habits tend to ossify into company culture, which is motivation to start thinking about a team’s design process early. Even if you decide against design documentation in the early stage of your company, going through the mental exercise of considering its implications will increase your team’s operational awareness.

Thanks to Marcel Molina and Gary Tsang for reading and providing feedback on drafts of this essay.

Introducing Armsible

2015-07-13T00:00:00-07:00

Update: Since the publication of this article, Armsible projects have since been folded into Whisker Labs’ GitHub organization.

Much ink has been spilled over the “Internet of Things”. A consequence of this trend is the rise of the single-board computer as a mainstream form factor for application development. With the popularity of open source¹ platforms like Raspberry Pi, Arduino, and BeagleBoard, it’s never been easier to build applications that encompass both hardware and software.

However, there is less publicly-available material on how to incorporate single-board computers into larger-scale deployments. A typical use case involves someone using an ARM computer to monitor or actuate devices in their home. The deployment workflow is more often than not akin to a Linux server administered manually through SSH sessions over the lifetime of the device. In contrast to the level of automation fetishized in the software operations community, the state of the art in the open source IoT space is remarkably unsophisticated.

In spirit, Armsible represents a call-to-action for the use of industry-standard provisioning tools and techniques in embedded applications². Specifically, it is a collection of Ansible roles and related tools that facilitate the automated deployment of single-board computers.

How do I use Armsible?

As of its unveiling, Armsible boils down to a few Ansible roles and a dynamic inventory script for targeting hosts on a local network. The initial use case is to provision a set of single-board computers on a LAN.

Armsible’s focused, albeit limited scope is a consequence of its intended use in concert with other roles from the Ansible community. A typical playbook for an embedded project will not be composed entirely of Armsible roles. Configuration management for standard components like DNS is a solved problem. Armsible fills the gaps between the needs of embedded applications and the existing suite of roles from the wider community.

To that end, we’d like Armsible to be the home for the following:

Roles for provisioning specific hardware platforms (e.g. Raspberry Pi, BeagleCore, Intel Edison)
Roles for installing and configuring software components that are needed by embedded developers but not currently covered by the open source Ansible community (e.g. the kernel watchdog, U-Boot, GPIO configuration)
Tooling that enforces best practices for embedded development

Why Ansible?

Ansible struck us as the right tool for the job because it is built around vanilla SSH connections. For embedded devices that run no-frills distributions of Linux, Ansible is much more applicable out of the box than other tools that rely on less-ubiquitous transport protocols and more-complicated topologies.

How is Armsible organized?

Armsible is structurally inspired by DebOps, a collection of Ansible playbooks for Debian-based server deployments. It comprises a number of Ansible roles stored as distinct repositories within ~~an Armsible GitHub organization~~ Whisker Labs’ GitHub organization. These roles are published to Ansible Galaxy and thus installable on the command-line with ansible-galaxy. A bin project is provided to house complementary tools (i.e. dynamic inventory scripts) to be used in conjunction with Armsible roles.

What plans exist for Armsible’s future?

The project spawned from the hardware provisioning needs of products developed at Whisker Labs. As such, the project’s initial offerings are a sample of what we’ve developed so far and are thus limited to the technologies we use.

Part of the intention behind open-sourcing this work is to foster a community around IoT hardware provisioning. We encourage anyone working in this space to take a look at Armsible and help make it more useful. The best ways to get involved are by filing GitHub issues on individual projects or joining the conversation in #armsible on irc.freenode.net.

The technologies in question are "open source" to varying degrees, but vendors' overall inclination towards open source is helping push the hardware world in the right direction. For instance, the Arduino and BeagleBoard/BeagleBone device families benefit greatly from the tooling, documentation, and manufacturing ecosystem afforded by open hardware design. ↩
"Embedded" should really be in air quotes here, given that we're talking about machines that run Linux. At the risk of graybeards not taking me seriously, I'm going to roll with it. ↩

Coordinating technological change in large software organizations

2014-06-19T00:00:00-07:00

The topic of software scalability seems to bring out the armchair general in everybody. Much of the culture of the software industry is fueled by anecdotal war stories, blog posts, and “this one paper you should read”. We are all knee-deep in an unending stream of literature prescribing ways to achieve maximum computer performance, but the organizational consequences of hyper-growth get far fewer headlines. I would argue that these consequences have more of an impact on the daily lives of more developers than the scalability of code. The structure of a company can determine what you work on and who you do it with. Without widespread appreciation for the cost of coordinating technology changes across such a dispersed group of people, it’s hard to imagine any single employee not being impacted by wasted time and miscommunication.

A common tactic for scaling a software engineering organization is to compartmentalize teams around various components that collectively make up the company’s product. The development team may be split into Frontend Engineering and Backend Engineering. Each of these may be subdivided into focus areas, terminating in teams that cover specific sets of technologies. In this manner, a company’s team structure is modelled as a tree (conveniently similar to how its personnel fit into a tree-based org chart):

For instance, “Backend Engineering” may encompass any piece of technology deeper in the stack than user-facing clients, from analytics pipelines and application servers down to databases and operating systems. This model is especially well-suited for the development of service-oriented architectures, in which the components of a product’s backend are encapsulated in network services each maintained by small teams.

The burden of coordination

A consequence of this organizational complexity is an increase in the amount of coordination required to make progress. Given the subdivision into specialized teams, any work to improve the overall product will necessarily involve multiple teams. For example, the task of adding a recommendations widget may spawn work for the web, iOS, and Android client teams, the creation of a new batch job to be built and maintained by the analytics team, and a new API endpoint to be added by an application services team. The burden imposed by this need for top-down, product-oriented coordination is part of what motivates the widespread criticism of “big companies”. Implicit in the idea of being an early employee at a growing company is the ability to be directly involved in the product. As a workforce grows, the perceived ability of any individual to affect change diminishes. Compared to the freedom and breadth enjoyed by employees of short-staffed small businesses, making an impact within a larger organization may seem like more trouble than it’s worth. This sentiment often manifests in technology-driven companies leaving a trail of “startup people” in their wake who step away from the company once it’s survived the trial by fire of early-stage growth.

However, well-run large organizations benefit from the higher throughput afforded by a larger workforce to apply to problems. A great example of this on a grand scale is Apple, whose ability to “walk and chew gum at the same time” results in concurrent efforts to drastically reshape both their mobile and desktop offerings.

This covers the macro-level work that trickles down from high-level product decisions, but not the variety that stems from changes deep in the stack. Infrastructure work results in a separate class of communication overhead.

Bottom-up coordination

The often underestimated counterpart of this top-down coordination is the cost of the bottom-up coordination imposed on developers working on infrastructure. By “infrastructure” I mean any technology that is depended upon by other developers. In this context, infrastructural work would include library development, database administration, and service ownership. For these kinds of teams, making profound changes implies effort to coordinate with numerous teams. For example, before migrating to a new database or replacing a deprecated library, the initiating team will have to communicate with many others. These scenarios inevitably cause friction with other teams, whether by imposing unplanned work on them or simply adding the operational risk of deploying new code.

Part of what distinguishes great infrastructure teams is a sense of empathy for those that depend on them. When attempting to move an organization forward with a new technology, such a team will reduce the barrier to entry by addressing any likely concerns and minimizing the amount of work that developers have to do to make the switch. By going the extra mile to ease the lives of others, the team initiating the change improves the likelihood of success and greases the wheels of forward progress.

Preventing surprises

When rolling out a new technology, the goal is to lessen the likelihood of something unexpected happening. This involves predicting and documenting the things that are unavoidably apt to change as a consequence of the new technology. To those without context, any change will be unexpected, so the main thing to strive for is increasing the organization’s collective awareness of the change without being annoying.

Thorough documentation can go a long way, whether it be on a wiki, an email, or whatever communication mechanism the company relies on. A good way to frame migration documentation is in terms of the deficiencies of the old way and how the new hotness will improve the situation. “We’re hitting the safe upper limit of how far we can scale Database Product X within budget and our testing shows that Database Product Y will suit our projected needs for the next year and save us n dollars per month.”

Part of this documentation’s purpose is to walk developers through the process of migrating their projects to the new hotness. This will vary depending on the type of migration. For instance, a library change would call for introductory background information, before/after code samples, and links to any relevant API reference documentation. It’s important to mention any operational effects the changes may have. For instance, if the new APIs entail different resource utilization rates (e.g. object allocation, TCP connection churn) or behavioral changes, then the documentation should include specific metrics to keep an eye on when deploying the new code.

Conclusion

Coordinating changes within large software organizations is a necessary evil. There are serious downsides to doing too little or too much, so keeping a manageable number of people informed is a balancing game. Given the definitionally wide reach of “infrastructure”, bottom-up coordination is a key part of introducing new technologies within an organization.

Thanks to Ruben Oanta and Johan Oskarsson for reading and providing feedback on drafts of this post.

Survey on Technical Debt Management

2013-06-04T00:00:00-07:00

First coined by Ward Cunningham in 1992, the concept of “technical debt” is widely known within the software engineering community. It evokes other colloquialisms such as “code rot”, “cruft”, and “kludge”. The word “hack” is often used synonymously, but its usage is now overloaded and popularized to the point of meaninglessness. From his keynote presentation at the 2013 International Workshop on Managing Technical Debt, Steve McConnell (of Code Complete fame) provides a good working definition of technical debt:

A design or construction approach that's expedient in the short term but that creates a technical context in which the same work will cost more to do later than it would cost to do now.
Steve McConnell, Managing Technical Debt

A sizable portion of the work done by my team at Twitter classifies as paying down technical debt. This is by no means meant as a negative. The performance gains from transitioning a Rails-based infrastructure into an ecosystem of JVM services have been gratifyingly enormous and the work itself is intellectually enriching. However, dealing with technical debt is generally considered to be undesirable in favor of feature development.

This sentiment is totally understandable. Greenfield work is sexy and fits the trope of the lone hacker cranking out code, fueled by caffeine and the Social Network soundtrack. The harsh reality is that when you’re working on systems of any meaningful scale, building in isolation is rare. There will always be dependencies, requirements, or even simply code you wrote two weeks ago that gets in your way.

Technical debt is a natural part of the software development process, and is thus unavoidable. There exist software anti-patterns that produce predictable debt, as codified in Michael Duell’s Resign Patterns. Through awareness and internalization of sanitary development techniques, one can prevent certain classes of technical debt from occurring in the first place. But for the inevitable cases when it falls through the cracks, a manageable strategy is to be mindful of the debt as it accumulates and to periodically make a concerted effort to pay it down.

Mindfulness toward technical debt

Just as with financial debt, there are multiple classes of technical debt with varying levels of insidiousness. There is “high interest” debt that will waste countless future hours of work. An example of this would be an inconsiderate choice of framework, resulting in great expense to port to a different system later on. In contrast, an item of low interest debt could be putting off writing a class’s test suite until after a milestone. If paid down soon after being taken on, this type of debt can be acceptable. However as low interest debt piles up, both in quantity and lifetime, it is increasingly dangerous and more onerous to deal with. If a development team is diligent about avoiding high and reducing low interest debt, they will be much more effective at reaching goals and staying productive in the long term.

Another axis on which to characterize debt is whether or not it’s taken on intentionally. Teams accrue intentional debt by making conscious decisions about the feasibility of their being able to handle the debt load later on. “We need to ship this feature ASAP, so let’s skip these tests until our next sprint.”

Unintentional debt is taken on carelessly, either by individuals’ actions or institutional change. On the level of an individual, a junior developer or contractor could introduce changes that render a system less maintainable. Depending on the complexity of the problem, code review is an effective preventative measure for these situations. Harder to deal with are large-scale events that inadvertently introduce vast tracts of debt. For example, the integration of an acquired company’s codebase or a coordinated refactor could leave a system in a less tenable state than it was before. There is no one-size-fits-all solution for such cases and they exemplify the importance of remaining mindful of debt accumulation.

In addition, it is important to track debt. With a log of specific debt items, a team can assess their debt load at any point and act accordingly. Without one, they are blindly flying into a minefield, condemned to endlessly fit square pegs into round holes. There is no way to reasonably fix the unmeasured quantity.

Planned payment of technical debt

Once a team locks down the rate at which they accumulate debt and makes a concerted effort to avoid the high-interest kind, paying down what remains is much more straightforward. From there, it’s simply a matter of prioritizing items in the debt log and chipping away at them.

The application of positive habit formation tactics can be very effective here. Just as someone wanting to get in better shape can explicitly plan gym visits into their schedule, software development teams can plan debt-reduction periods into your release cycles. This can take many forms, depending on the temperament of the team:

Baking debt-repayment into the sprint cycle. (e.g. devoting a portion of each sprint or one entire sprint per month/quarter to tackling items on the debt log)
Having a debt-reduction rotation wherein individuals focus on debt during their duty cycle.
Spinning out debt-reduction into its own project with a separate pool of resources. I’m admittedly skeptical of this approach. It seems to be analogous to a garbage collection problem, in which a mutator (the development team) is continuously introducing work items to be fixed by a collector (the debt-reduction squad). This is theoretically feasible if debt introduction is kept at a reasonable rate, but the division seems unmanageable to me.

Conclusion

McConnell’s viewpoint is abstract and arguably too high level to be of much use for certain development teams. The strategy presented here meshes well with what I’ve experienced at Twitter, but I admittedly may be writing from a BigCo stance. It’s been pointed out that McConnell’s principles don’t necessarily suit the realities of smaller companies. It would be interesting to examine this statement in another post, focusing on debt accumulation and fallout as companies grow.

Technical debt is often preventable, but an inevitable part of the software development process. As much as it hurts one’s pride to hear it, everyone writes unthoughtful code some of the time. In order to keep systems maintainable, teams must adopt a strategic approach to controlling the rate at which debt accumulates, tracking the specific items that are deemed short-term-acceptable, and paying them down. Through this, a team can avoid much of the productivity and morale degradation associated with technical debt buildup.

If you find this topic interesting, I would encourage you to read through McConnell’s slides. My notes on the slides are available in this gist.

Thanks to Trevor Bramble, Mike Bernstein, and Richard Bailey for reading and providing feedback on drafts of this post.

TTLs for Dropbox

2011-10-31T00:00:00-07:00

A bunch of friends and I have a Dropbox shared folder in which we swap files of various (legal) sorts. Most of the folks in the group aren’t Dropbox zealots like myself who find ways to get 9+ GB for free. Thus the size of the directory in question becomes an issue as large forgotten files start to eat up others’ precious 2GB of space.

As a solution to this problem, I wrote a Node.js program that in essence lets you assign TTLs to items within a Dropbox directory. It runs as a daemon and deletes any files older than a specified lifetime.

For example, to run a daemon that checks the directory Dropbox/expirable-items once a day for items that are older than a week, modify the variable declarations thusly:

var dirToWatch = "expirable-items",
    ttl = 604800000, // 7 days
    interval = 86400; // 24 hours

The program depends on the log.js and dropbox Node modules:

$ npm install log dropbox

Startup and delete events are logged to stdout, so redirect as you see fit:

$ node app.js > dropbox-ttl.log

Teach Scala to undergrads

2011-09-26T00:00:00-07:00

A symptom of Scala’s growing popularity is the incessant discussion of its place in the bevy of industrial programming languages. This debate is often confusing, as both advocates and detractors of the language at times use the same argument in their favor: that Scala’s complexity renders it unfit for use by the average developer. This talking point may generate votes on Hacker News, but it isn’t remarkably productive at improving the state of software development.

People have been demonizing the rise of JavaSchools for years and I believe Scala to be an effective countermeasure. It represents the perfect supplement to a programming languages course, with the ability to show students how powerful functional programming is when applied to “real world problems”. As a single example, seeing how one can use higher order functions to avoid manual iteration through collections is enough to at least show students how much easier life can be with Scala.

I posit that the outlook of many students coming out of PL courses is akin to this continuum:

On one end you have “academic” languages like Haskell, ML, and Scheme which are interesting, but esoteric and impractical in that they’re rarely used in production environments due to their difficulty. On the other are the common currency of most software developers: Java and C (and Ruby and Python within more hip circles). The languages on the right are influenced by the research that culminates in the languages on the left in the same way that mainstream musical artists say that they listen to Thelonious Monk and Stravinsky to get ideas.

Scala fits somewhere in the middle. It’s a reasonably approachable language with a rapidly growing community and ample room for neckbearding. As proven by Foursquare, Tumblr, Twitter, Yammer, etc, Scala is a remarkable language for building the kinds of systems that CS students swoon over. After teaching ML, Haskell, or Scheme (WLOG), one could use Scala to show that many of the most expressive features of functional programming can be harnessed for use in a JVM language. Helping students connect the dots between imperative and functional programming would be a valuable lesson that many students don’t fully understand.

More emphasis should be placed on experimenting with ways of raising the bar of the “average developer”. While I agree with the sentiments behind the notion that Scala is “too hard for a large portion of the Java community”, this comes off as more of a statement about Java developers than about Scala. If Scala is going to be pigeonholed into strictly being for a higher class of programmer, then why not enlighten students in their formative years?

Note: This argument could just as easily be made in favor of Clojure. The point is to experiment with improving the state of average instead of saying things are too hard.

Two months in

2011-09-05T00:00:00-07:00

Like countless others on the internet, I’ve been “meaning to write more” for a long time. Under the assumption that Wordpress puts too much process into the task of blogging, I’ve designed a new personal website using a simpler tool. Hopefully the ability to write essays using the same workflow that I use to write code will grease the wheels of expression.

My old Wordpress site is now accessible at old.evanmeagher.net. The new site is hosted on GitHub and its source is available here.

Last friday was the two month mark of my employment at Twitter, Inc. I don’t think that I could be happier with my current situation. Twitter is proving to be exactly the workplace that I was hoping for: a friendly and open atmosphere with brilliant coworkers more than willing to help me learn everything that I can as quickly as possible. Coming out of college, it’s exactly the kind of environment that I want to be in to further my technical education.

As for the contents of this blog, I intend to write about what I learn. At the moment, this would include things about Scala, functional programming, and distributed systems, but my interests are bound to ebb and flow as I work on different projects and interact with different people.

To keep up to date with me, you can subscribe to this blog or follow me on Twitter for more granular updates.

Graduation

2011-06-27T00:00:00-07:00

It’s been a little over two weeks since I graduated from college. Tomorrow I’ll pack my life into a truck and begin the 800-mile journey from Seattle to San Francisco.

This move has been the light at the end of my tunnel for the past six months. In December, I turned down a job at Google Seattle in favor of one at Twitter, to the bewilderment of many of my friends and family. With a new city and an exciting job looming on the horizon, I’ve spent the first half of 2011 finishing my last two quarters of school and mentally preparing myself for a head-first dive into Silicon Valley.

As I begin a new chapter of my life, it seems like as good a time as any to take a crack at my lofty, neglected goal of writing more. Thus, I’ve created this blog on which to write about things that interest me. Stay tuned to see if I follow through.

Evan Meagher

More trail safety tips

1. Bring water

2. Bring your phone

3. Bring a whistle

4. Bring your ID and insurance card

Trail runners should pack first aid kits

Why pack a first aid kit?

What I pack in mine

Other items you might include

Momentum is magic

What can momentum look like in practice?

Takeaway

Further reading

How to avoid overengineering

What do we mean by “overengineering”?

​​Cargo culting

The Xoogler effect

Cognitive biases

How to resist the urge to overengineer

Assessing cargo without becoming a cultist

Further reading

Things I Learned From Five Years in Climate Tech

Timeline: What has this guy actually done?

1. Consumers don’t care about energy

2. Exits are different than those for traditional tech startups

3. You live or die by the trust you build in the industry

4. Energy economics are a poor match for venture capital

5. Beware the utility sales cycle

6. Policy is more important than technology

Closing words of encouragement

My Climate Lobby Hobby

Defining resiliency in energy and software

Setting the stage

Defining reliability and resiliency

We aren’t so different, you and I

Lessons Learned Putting a Thing on the Internet

Lesson 1: Mind your protocols and queues

Time scales

Protocols

Lesson 2: Err on the side of conservatism

Lesson 3: Everything old is new again

Conclusion

Aiming for sustainability

Sustainability

Luck

Alignment

Systems Thinkpiece

Tech

Closing

Introducing JSON Toggle

Background

A protocol for defining feature toggles

An example toggle spec

Toggle Specification specification

A Java 8 library for working with JSON Toggle

Now what?

Synthesis over invention

Nerd nostalgia

A synthetic character

Wherein software eats my book report

Towards synthesis

References

Adaptive compression of periodic signals

An introductory example

Characterizing the data

Leveraging pseudo-periodicity

Algorithm formulation

Academic interlude

Quantifying deviance

The algorithm

Example and results

Other investigated areas of research

Conclusion

References

Design documentation at small companies

Traveling by map

External dependencies

Precautionary migration planning

Countering the logical conclusion

Cargo culting