Learn / Blog / Article
Observability for product teams: what, why, who, and how
Observability is not a new concept in the software industry, but it still amazes me how many different interpretations I hear about this term. In many cases, observability ownership appears ambiguous, and it's also common to see it treated as an afterthought.
Last updated12 Sep 2023
Have you ever released a new experiment or feature into the wild only to discover that the team cannot determine its success? What if different people have different ideas about what success implies?
In this post, I want to share my perspective on what observability brings to product teams and how it can become an automatic way of thinking.
What does observability mean?
Observability was introduced in the 1960s by R. E. Kalman as the ability to understand the internal state of a system by observing its external outputs. It is a fundamental concept in control theory, systems theory, and computer science, particularly in the context of distributed systems and microservices.
Since then, we can agree that the software industry has evolved. Today, team compositions represent the system's ideal future state (Conway's Law and the Cognitive Load Theory). We've seen an increase in modularity, reflected in smaller product teams owning scoped software modules and features with more explicit boundaries. Observing the system's internal state is no longer sufficient to determine success.
For example, we can have an endpoint feature that operates at 99.999% uptime, but does that mean it's successful? It depends on what we decided to optimize for and how well-received the feature is from our users. Let's simplify this engineering-heavy term: the word ‘observability’ comes from the verb ‘observe’, meaning ‘to take notice of something’ and regard it with attention.
So, let’s start with a more straightforward fact: product teams need to be able to observe their product!
Observability vs. monitoring
Another misconception I often see is telemetry, or monitoring, being confused with observability. Monitoring and alerting are tools that work on top of observable properties. I like to describe observability using a definition my colleague, Senior Director of Engineering Vasco Pinho, shared with me: “Observability should be treated as a required property of the system that enables us to measure success, learn, and decide where to invest.”
On top of observable properties, we can build monitors, SLOs, HEART metric graphs, and more, but we need the data to be in place first. Observable properties are proactively implemented to enable monitoring to be built on top.
Here are two examples to clarify the difference between observable properties and the tooling used on top:
Out-of-work example: a person goes out for a jog, and as they jog, they keep looking at their smartwatch to regulate their pace based on the heart bpm
Observable properties: body data points, heartbeat, the actual pace
Tools used: smartwatch monitors and alarms for high heart rate
Software example: a team releases a new API endpoint, and they set up monitors to know when something is not going well
Observable properties: metrics, logs, and traces
Tools used: monitors, SLOs, graphs, and alerts
This data will reside in multiple systems because each system focuses on a specific area. One does not replace the other, and we must ensure we’re using the right tools for the right job, too.
Datadog: allows you to determine performance metrics as well as event monitoring for infrastructure and cloud services
Mixpanel: enables you to capture data on how users interact with your digital product
Product experience: product experience insights tools like Hotjar give you behavior analytics and feedback data to help you empathize with and understand your customers
All sources matter and are equally important because they explain a different aspect—part of the sum—but the end-user experience of a feature is the sum of all the parts.
It is essential to distinguish between the observable properties and the tooling used on top of it.
Why does it matter?
Reason 1: HEART metrics dependency
The first reason revolves around Google’s HEART metrics relying on the foundation of reliability. Why? Because reliability is what leads users to trust our product. Here is how I think of trust:
“Trust = reliability = consistent behavior over time”
This equation means that trust depends on reliability, and reliability is reflected through providing a consistent user experience that meets our users' expectations over time.
Each time a user uses the system and it doesn’t behave as expected, it leads to confusion, frustration, doubt, and a dent in trust. By consistency here I don’t just mean UX consistency across different features of our product. I also mean a consistent ability to access our product (no downtime) and consistent user operation performance (requests not taking too long). To achieve consistency, we need to proactively learn about and respond to different life events happening around us all the time. And this is what an observability mindset enables.
Reason 2: delivering the correct value faster
The second reason is an increased ability to deliver value to our users faster by investing in the right place at the right time. When building a minimum viable product (MVP), it's common to cut corners to release them quickly and start collecting user feedback, but I want to break this down a little more. Let’s start with the definition of MVP:
A minimum viable product (MVP) is the first version of a product fit for market. An MVP has core functionality and, coupled with customer feedback, is a learning tool for product teams to release new features and better iterations of the product.
Teams focusing on releasing an MVP try their best to avoid lengthy and (ultimately) ‘unnecessary work’, but how do we classify unnecessary work? Does observability fall into this bucket? Let’s dig deeper and pay some attention to the V in MVP: viable. Viable means capable of working successfully; feasible.
How do we make sure our feature is viable? Through proactively planned observability. Releasing a feature into the wild and expecting customer feedback via 'contact us' forms will get us some feedback—but only from that small subset of loyal users willing to take the additional steps to get the input through.
With observable data in place and the tooling to help visualize it, teams can learn much more in much less time. All this leads me to conclude that observability is not one of those corners you want to cut during an MVP.
Now, I’ll use a fictional example explaining how speed depends on good observability. Imagine driving a car much faster than you’re used to, like a Lamborghini Huracan with a V10 engine, 0-100 km/h in about three seconds. With that much speed, the driver needs to have clear visibility of what is coming up ahead, right? For product squads, trying to release faster with little (or no) observability is like driving a Huracan with a blindfold on (please don’t try this at home if you own one).
Reason 3: alignment
My third and final reason is alignment. By proactively thinking of what data we need to observe, we also define what success looks like. Observability provides a common language and shared understanding of the product and its performance. Through observability, the team can access the same real-time data and insights, allowing them to make data-driven decisions and stay aligned with the product's goals and objectives. This alignment helps to ensure that everyone is working towards the same vision and objectives, reducing misunderstandings and improving overall collaboration and teamwork.
In summary, observability promotes alignment by providing a shared understanding of the product, enabling the team to make data-driven decisions, and fostering collaboration and teamwork towards a common goal. Regarding MVPs and observability, it’s essential to strike a good balance and identify the minimum observable properties needed. With this agreement documented, we leave less room for interpretation.
Who is the owner?
Observable data comes from different sources and is no longer a topic solely driven by engineering. It’s something that all angles of a product need to reflect upon:
In a nutshell, the owning product squad owns observability. Let’s look at a scenario to help explain why.
🔎 Observability for product teams: a case study
A product team finished the last commit of a feature, the reviewer approved the Merge Request, and the CI pipeline pushed this to production. We switch on the feature flag, and our users experience the new feature. After a few hours, we receive feedback from users that they cannot perform fundamental operations, which escalates into an incident.
The team that worked on the feature gets pulled in to assess the impact. After further analysis, the problem is identified, and a fix is deployed.
For this example, let’s assume the whole process took four hours to detect the incident and another four hours to get the fix live, with the entire squad looking into it. This means we have eight hours of user impact, but that’s not all. The team that worked on the feature returned to their sprint board, but they now lost half a day, with the entire team focusing on the unplanned incident.
As a repercussion, some tasks must drop out of the sprint, and the team will not meet their goal. The disruption impacted the team owning the feature mainly because the issue was discovered reactively.
As we can see, the feature-owning team is the most disrupted in this scenario. The scenario makes it more apparent that product teams are responsible for observability not only due to the disruptions to their planned work but also because they want to see their features succeed and their users happy. As explained earlier, observability covers different aspects; all team members can contribute. The easier it is for the team to proactively learn about their new feature’s success, the less impact on their users—and the less reactive disruptions in their weekly goals.
However, it's important to note that observability is a cross-functional responsibility. Everyone in the team should know the importance of observability and participate in building and maintaining its infrastructure.
How should you go about it?
Teams must agree that observability is not optional, and for every feature, it should be a required part of the definition of done. When looking at a new user story, it can only be considered ready for sizing if the definition of success is clear. With this in place, the team can assess the observable properties needed during implementation, what quantitative feedback we want to collect as part of the user experience journey, and what additional tooling we need to monitor this data.
Every time we design a new feature or a change to an existing one, we must think of the observable properties, user behaviors, or system behaviors we must observe to measure the product's success.
From an engineering perspective, one must consider the following:
Logging and log levels: using logging to record events, errors, and debugging information to make it easier to query and analyze
Instrumentation: instrumentation to the code to capture performance metrics, such as response times, error rates, and throughput
Distributed tracing: using distributed tracing to track requests as they move through different systems to help identify bottlenecks and debug complex problems
Monitoring platform: the platform will help us monitor this data for anomalies and failures and allow us to set up alerts and notifications
In our MVP and initial releases, we don’t need all the fancy SLOs, monitors, and graphs. We need the foundational data that will help us determine success from failure and understand early system performance so that we can learn from it. With a limited set of observable properties and the necessary tooling, we can determine where to iterate and what additional tools are needed.
In summary, to avoid treating observability as a speed bump, we need to plan ahead for it.
Adopting a new mindset
Observability is not just metrics and data. It’s a mindset change product teams need so they can drive faster and invest in the right space. In summary, you need to answer the following questions to get there:
What is it? Observability is a property of the system that enables us to measure success, learn, and decide where to invest.
Why does it matter?
HEART metrics depend on trust: trust depends on reliability, which is reflected by a consistent behavior over time.
Observability is a requirement for speed: to go faster, we need more visibility. It also allows us to invest in the right place at the right time.
Observability creates alignment by defining what success should look like
Who is the owner? Observability is owned by product squads, not just one member, because of their commitment to their users and because they are the ones disrupted by reactive escalations the most.
How should you go about it? Observability needs to be a required part of the definition of done for a user story.
I hope you found this blog post informative and enjoyable. I’m always looking for ways to improve, so if you have any thoughts or feedback on this post or other ideas on how observability should be (or is) treated within your team or company, please don't hesitate to share them via the Hotjar Feedback widget—it’s that red button to the right of your screen.
Take your product to the next level with Hotjar 🚀
Product experience insights from tools like Recordings, Heatmaps, and Surveys help you put your users’ needs first. Get access to all these tools—and more—for free.
Hotjar's tech blog
3 common questions from Hotjar’s engineering candidates, answered
At Hotjar, our Engineering team often gets questions from candidates in the hiring process about how we function internally. Questions like these tend to come up after a candidate’s technical interview, when we may have run out of time to discuss them in depth. We know this information is useful for candidates, and we can’t always address these questions as fully as we’d like. So, we’re sharing our responses here.
Simon Agius Muscat
Hotjar's tech blog
Lazy loading: when to use it and why you should
Lazy loading or loading content on demand is the process of identifying resources that are non-blocking to a website and delaying their loading or initialization until the page needs them.
Hotjar's tech blog
Prioritizing brilliantly: better alternatives to productivity guilt and grind
As software engineers, we pursue high-quality solutions against a ticking clock. But spending too much time perfecting solutions before delivery delays the result and increases the cost of adapting these solutions to new learnings. What’s more, the approach of grinding to the finish line is not sustainable.