Agile, Waterfall, and Lean are just a few of the project-centric methodologies for software development that you'll find in this Zone. Whether your team is focused on goals like achieving greater speed, having well-defined project scopes, or using fewer resources, the approach you adopt will offer clear guidelines to help structure your team's work. In this Zone, you'll find resources on user stories, implementation examples, and more to help you decide which methodology is the best fit and apply it in your development practices.
Decision-Making Model: SOLVED
Boost Chapter Performance With the Chapter Health Diagnostic (Part 1)
Your team is supposed to use an Agile approach, such as Scrum. But you have a years-long backlog, your standups are individual status reports, and you’re still multitasking. You and your team members wish you had the chance to do great work, but this feels a lot like an “agile” death march. There’s a reason you feel that way. You’re using fake agility—a waterfall lifecycle masquerading as an agile approach. Worse, fake agility is the norm in our industry. Now, there is light at the end of the tunnel; let’s delve into Tackling Fake Agility with Johanna Rothman! Watch the video now: “Agile” Does Not Work for You? Tackling Fake Agility with Johanna Rothman at the 59th Hands-on Agile Meetup. Abstract: Tackling Fake Agility Your team is supposed to use an Agile approach, such as Scrum. But you have a years-long backlog, your standups are individual status reports, and you’re still multitasking. You and your team members wish you had the chance to do great work, but this feels a lot like an “agile” death march. There’s a reason you feel that way. You’re using fake agility—a waterfall lifecycle masquerading as an Agile approach. Worse, fake agility is the norm in our industry. No one has to work that way. Instead, you can assess your culture, project, and product risks to select a different approach. That will allow you to choose how to collaborate so you can iterate over features and when to deliver value. When you do, you are more likely to discover actual agility and an easier way to work. The learning objectives of Johanna’s session on Tackling Fake Agility were: Have a clear understanding of the different lifecycles and when to use each. Be able to assess your project, product, and portfolio risks. Know how to customize a lifecycle based on the unique culture and requirements of the team. How to create shorter feedback loops in any lifecycle for product success. Questions and Answers During the Q&A part on Tackling Fake Agility, Johanna answered the following questions, among others: How do we model risk? Possible approaches? How do we measure risk? Possible approaches? How do we model value? Possible approaches? How do we measure value? Possible approaches? No matter how we try to have teams work vertically, we get teams saying that they need a cohesive team or microservice team as they need to build things, and the others will build on top of them. What do you think? How can the organization measure the benefit of agility? In some software development teams, it seems natural to have the design and mock-up ready before the development, before the sprint planning, and QA done after, sometimes in the next sprint, and it seems to work for them better than doing all in the same Sprint. Why do Architecture and Requirements work in dedicated time ranges ahead of increments? Does that hold for other business analysis activities like a risk analysis? Based on your experience, what must we do to be valuable Agile coaches or consultants? Are there any cases in which using the cost of delay does not work, or would you not use it? Watch the recording of Johanna Rothman’s Tackling Fake Agility session now: Meet Johanna Rothman “People know me as the “Pragmatic Manager.” I offer frank advice—often with a little humor—for your tough problems. I help leaders and managers see and do reasonable things that work. Equipped with that knowledge, you can decide how to adapt your product development, always focusing on the business outcomes you need. My philosophy is that people want to do a good job. They don’t always know what they are supposed to do, nor how to do it.” Connect with Johanna Rothman Johanna’s Blog Johanna Rothman on LinkedIn
Product Ownership Is a Crucial Element in Improving Outcomes SAFe and Scrum both consider product ownership crucial to maximizing outcomes in environments of complexity and uncertainty. Teams are ideally organized around products/value streams so they can apply customer-centricity. Product people and their teams are ideally accountable for outcomes and are empowered to figure out, inspect, and adapt requirements/scope as needed. SAFe Has Multiple Product Ownership/Management Layers As organizations tackle bigger products, they have some alternatives for how to tackle product ownership/management. Scrum advises having one product owner for each product, even if multiple teams develop the product. This is at the core of scaling frameworks such as Nexus and LeSS. SAFe takes a path that is more aligned with the classic structure of product management organizations, which is to have multiple layers of product ownership/management. Product owners own the product at the Agile Team level. Product managers own product at the teams of Agile teams level (Agile Release Trains). Solution managers own products for huge teams of teams working on even larger products/solutions. Why Did SAFe Make This Choice? SAFe takes the perspective of learning from experience in the trenches and what patterns organizations are using and applying lean/Agile principles as needed to help organizations evolve. And many organizations have been struggling to scale product ownership when we're talking about multiple teams. Product management experts such as Melissa Perri also talk about multiple product management roles (see some thoughts about how this relates to SAFe below). Interestingly enough, Scrum@Scale also has product owners at every level. And LeSS/Nexus also introduce multiple product owners when you scale beyond a handful of teams. The advantage of this approach is that it aligns with the product manager/owner journey. Working closely with one or two teams, owning product choices for a couple of product features or a certain slice of the product, can be a great jumping point for junior product managers/owners (What Melissa Perri refers to as associate product managers in Escaping the Build Trap). As the product manager/owner gains experience, they can take on a whole product themselves. It takes time for a product owner/manager to gain the experience to act as the visionary entrepreneur for their product. They might start feeling more comfortable writing stories and executing experiments and, over time, learn to influence, design product experiments, and make tougher prioritization decisions with multiple demanding stakeholders. In other words, product managers/owners naturally evolve from focusing on tactics to strategy over time. What Are Some Downsides To Splitting Product Responsibilities Between the Product Owner and Product Manager? An anti-pattern we often see is that the PM/PO split allows an organization to staff the PO role with “story writers” and “project managers” — people who aren’t empowered as product owners, and that reinforce the project mindset of requirement order-taking and managing scope-budget-timeline. This lack of empowerment leads to delays and an environment where the team is focused on outputs rather than outcomes. Empowering product owners and their teams is a common challenge in SAFe AND Scrum. What I’ve seen work well is carving out an appropriate product scope within which the product owner and team are empowered to figure out what to build to achieve the desired outcomes and optimize the value of that product or that aspect of a bigger product. Doing this requires figuring out the product architecture and moving towards an empowering leadership style. As in many other areas, SAFe takes the evolutionary approach. If you’re a purist or a revolutionary, you’ll probably struggle with it. Real-world practitioners are more likely to relate to the evolutionary approach. It’s important to ensure that the PO/PM separation is not seen as an excuse to continue doing everything the same. Product Managers and Product Owners: A Collaborative Relationship Leaders implementing the PO/PM split should ensure healthy collaboration, involvement, and partnership across the product ownership/management team. Product managers should internalize the SAFe principles of unleashing the intrinsic motivation of knowledge workers, in this case, product owners. Product managers have a role as lean/Agile leaders to nurture the competence, awareness, and alignment in the product team that would enable them to decentralize control and let product owners OWN a certain slice of the product. Product managers and product owners should discuss what decisions make sense to centralize and which should be decentralized. The goal of product managers should be to grow product owners over time so they can make more and more decisions — and minimize the decisions that need to be made centrally. This is key to scaling without slowing down decision-making while maintaining and ideally improving outcomes aligned with strategic goals. Increased Risk of Misunderstandings Around Product Ownership With Product Roles Filled by Non-Product People One very specific risk of the SAFe choice to split the PM and PO roles is that it creates the need for many more people in a product role than the number of people in the product organization. This vacuum pulls people like business analysts, project managers, and development managers into the product owner role. Some people can become great product owners but come with very little product (management) experience. Business analysts, for example, are used to consider what the customers say as requirements. They are used to the “waiter” mindset. They struggle to say no or to think strategically about what should be in the future or what should be in the product. Development managers are used to being subject matter experts, guiding their people at the solution level, and managing the work. Project managers are used to focusing on managing scope/budget/timeline rather than value and outcomes. Use the Professional Scrum Product Ownership Stances to Improve your SAFe Product Ownership One technique I found very useful is to review the Professional Scrum Product Ownership Stances with SAFe product owners and product managers. We try to identify which misunderstood stances we’re exhibiting and what structures are reinforcing these misunderstood stances/behaviors. For example — what’s causing us to be “story writers”? We explore the preferred product owner stances and discuss what’s holding us back from being in these stances. Why is it so hard to be an “experimenter,” for example? An emerging realization from these conversations is that SAFe structurally creates a setup where team-level product owners play “story writers” and “subject matter experts” more often. It’s non-trivial to switch to an environment where they are a “customer representative” and a “collaborator” with the space to “experiment” with their team towards the right outcome rather than take requirements as a given. It’s hard to get SAFe product managers to be the “visionary,” “experimenter”, and “influencer”. The issue here isn’t unique to SAFe. Product owners struggle to exhibit these behaviors in most Scrum environments as well. What I find useful is to align on a “North Star” vision of what we WANT product ownership to look like at scale and take small steps in that direction continuously, rather than settle for “project management” or “business analysis” called a new name. SAFe Product Management: Providing Vision and Cohesion in a Pharma IT Environment Let’s close with an example of how this can play out in practice. I'm working with the IT organization of a pharmaceutical company. As they were thinking about how to help their Enterprise Applications group become more agile, one of the key questions was how do we create product teams that are empowered to directly support the business — by minimizing dependencies and creating real ownership of each of the enterprise applications as a platform that other IT functions can more easily build off of and leverage. Several Enterprise Applications have multiple teams working on different aspects of them. We created stream-aligned teams, each owning and managing that aspect as a product. The product owners and their teams are empowered to consider needs and wants coming in from other IT functions and the business and shape the future of their product. Most of these product ownership decisions happen at the team level. Product managers focus on alignment and cohesion across the platform. We are still working on establishing the right mechanisms to balance vision/alignment with local initiatives at the team level. So, Now What? SAFe’s approach to product ownership is a frequent target of criticism in the hard-core Agile community. Some of it is pure business/commercial posturing (aka FUD), and some of it is fair and constructive. My aim in this article was to help practitioners explore the rationale, the potential, and the risks behind SAFe’s approach to product ownership, as well as some patterns and models, such as the Professional Scrum Product Ownership stances, that can be used to inspect and adapt/grow the effectiveness of your product ownership approach. As an individual product owner or product manager, you can use these models/patterns to drive your learning journey and help you structure your organization's conversation around creating the environment that empowers you to be a real product owner or product manager. As leaders of product organizations in a SAFe environment, I hope this can help you establish a vision of how you want your product organization to look like and guide you on the way there.
Imagine entering a bustling workshop - not of whirring machines, but of minds collaborating. This is the true essence of software programming at its core: a collective effort where code serves not just as instructions for machines, but as a shared language among developers. However, unlike spoken languages, code can often become an obscure dialect, shrouded in complexity and inaccessible to newcomers. This is where the art of writing code for humans comes into play, transforming cryptic scripts into narratives that others can easily understand. After all, a primary group of users for our code are software engineers; those who are currently working with us or will work on our code in the future. This creates a shift in our software development mindset. Writing code just for the machines to understand and execute is not enough. It's necessary but not sufficient. If our code is easily human-readable and understandable then we've made a sufficient step towards manageable code complexity. This article focuses on how human-centric code can help towards manageable code complexity. There exist a number of best practices but they should be handled with careful thinking and consideration of our context. Finally, the jungle metaphor is used to explain some basic dynamics of code complexity. The Labyrinth of Complexity What is the nemesis of all human-readable code? Complexity. As projects evolve, features multiply, and lines of code snake across the screen, understanding becomes a daunting task. To combat this, developers wield a set of time-tested principles, their weapons against chaos. It is important to keep in mind that complexity is inevitable. It may be minimal complexity or high complexity, but one key takeaway here is that complexity creeps in, but it doesn't have to conquer our code. We must be vigilant and act early so that we can write code that keeps growing and not groaning. Slowing Down By applying good practices like modular design, clear naming conventions, proper documentation, and principles like those mentioned in the next paragraph, we can significantly mitigate the rate at which complexity increases. This makes code easier to understand, maintain, and modify, even as it grows. Breaking Down Complexity We can use techniques like refactoring and code reviews to identify and eliminate unnecessary complexity within existing codebases. This doesn't eliminate all complexity, but it can be significantly reduced. Choosing Better Tools and Approaches Newer programming languages and paradigms often focus on reducing complexity by design. For example, functional programming promotes immutability and modularity, which can lead to less intricate code structures. Complete Elimination of Complexity Slowing down code complexity is one thing, reducing it is another thing and completely eliminating it is something different that is rarely achievable in practice. Time-Tested Principles Below, we can find a sample of principles that may help our battle against complexity. It is by no means an exhaustive list, but it helps to make our point that context is king. While these principles offer valuable guidance, rigid adherence can sometimes backfire. Always consider the specific context of your project. Over-applying principles like Single Responsibility or Interface Segregation can lead to a bloated codebase that obscures core functionality. Don't Make Me Think Strive for code that reads naturally and requires minimal mental effort to grasp. Use clear logic and self-explanatory structures over overly convoluted designs. Make understanding the code as easy as possible for both yourself and others. Encapsulation Group related data and functionalities within classes or modules to promote data hiding and better organization. Loose Coupling Minimize dependencies between different parts of your codebase, making it easier to modify and test individual components. Separation of Concerns Divide your code into distinct layers (e.g., presentation, business logic, data access) for better maintainability and reusability. Readability Use meaningful names, consistent formatting, and comments to explain the "why" behind the code. Design Patterns (Wisely) Understand and apply these common solutions, but avoid forcing their use. For example, the SOLID principles can be summarised as follows: Single Responsibility Principle (SRP) Imagine a Swiss Army knife with a million tools. While cool, it's impractical. Similarly, code should focus on one well-defined task per class. This makes it easier to understand, maintain, and avoid unintended consequences when modifying the code. Open/Closed Principle (OCP) Think of LEGO bricks. You can build countless things without changing the individual bricks themselves. In software, OCP encourages adding new functionality through extensions, leaving the core code untouched. This keeps the code stable and adaptable. fbusin Substitution Principle (LSP) Imagine sending your friend to replace you at work. They might do things slightly differently, but they should fulfill the same role seamlessly. The LSP ensures that subtypes (inheritances) can seamlessly replace their base types without causing errors or unexpected behavior. Interface Segregation Principle (ISP) Imagine a remote with all buttons crammed together. Confusing, right? The ISP advocates for creating smaller, specialized interfaces instead of one giant one. This makes code clearer and easier to use, as different parts only interact with the functionalities they need. Dependency Inversion Principle (DIP) Picture relying on specific tools for every task. Impractical! DIP suggests depending on abstractions (interfaces) instead of concrete implementations. This allows you to easily swap out implementations without affecting the rest of the code, promoting flexibility and testability. Refactoring Regularly revisit and improve the codebase to enhance clarity and efficiency. Simplicity (KISS) Prioritize clear design, avoiding unnecessary features and over-engineering. DRY (Don't Repeat Yourself) Eliminate code duplication by using functions, classes, and modules. Documentation Write clear explanations for both code and software usage, aiding users and future developers. How Misuse Can Backfire While the listed principles aim for clarity and simplicity, their misapplication can lead to the opposite effect. Here are some examples. 1. Overdoing SOLID Strict SRP Imagine splitting a class with several well-defined responsibilities into multiple smaller classes, each handling a single, minuscule task. This can create unnecessary complexity with numerous classes and dependencies, hindering understanding. Obsessive OCP Implementing interfaces for every potential future extension, even for unlikely scenarios, may bloat the codebase with unused abstractions and complicate understanding the actual functionality. 2. Misusing Design Patterns Forced Factory Pattern Applying a factory pattern when simply creating objects directly makes sense, but can introduce unnecessary complexity and abstraction, especially in simpler projects. Overkill Singleton Using a singleton pattern for every service or utility class, even when unnecessary can create global state management issues and tightly coupled code. 3. Excessive Refactoring Refactoring Mania Constantly refactoring without a clear goal or justification can introduce churn, making the codebase unstable and harder to follow for other developers. Premature Optimization Optimizing code for potential future performance bottlenecks prematurely can create complex solutions that may never be needed, adding unnecessary overhead and reducing readability. 4. Misunderstood Encapsulation Data Fortress Overly restrictive encapsulation, hiding all internal data and methods behind complex accessors, can hinder understanding and make code harder to test and modify. 5. Ignoring Context Blindly Applying Principles Rigidly adhering to principles without considering the project's specific needs can lead to solutions that are overly complex and cumbersome for the given context. Remember The goal is to use these principles as guidelines, not strict rules. Simplicity and clarity are paramount, even if it means deviating from a principle in specific situations. Context is king: Adapt your approach based on the project's unique needs and complexity. By understanding these potential pitfalls and applying the principles judiciously, you can use them to write code that is both clear and efficient, avoiding the trap of over-engineering. The Importance of Human-Centric Code Regardless of the primary user, writing clear, understandable code benefits everyone involved. From faster collaboration and knowledge sharing to reduced maintenance and improved software quality. 1. Faster Collaboration and Knowledge Sharing Onboarding becomes a breeze: New developers can quickly grasp the code's structure and intent, reducing the time they spend deciphering cryptic logic. Knowledge flows freely: Clear code fosters open communication and collaboration within teams. Developers can easily share ideas, understand each other's contributions, and build upon previous work. Collective intelligence flourishes: When everyone understands the codebase, diverse perspectives and solutions can emerge, leading to more innovative and robust software. 2. Reduced Future Maintenance Costs Bug fixes become adventures, not nightmares: Debugging is significantly faster when the code is well-structured and easy to navigate. Developers can pinpoint issues quicker, reducing the time and resources spent on troubleshooting. Updates are a breeze, not a burden: Adding new features or modifying existing functionality becomes less daunting when the codebase is clear and understandable. This translates to lower maintenance costs and faster development cycles. Technical debt stays in check: Clear code makes it easier to refactor and improve the codebase over time, preventing technical debt from accumulating and hindering future progress. 3. Improved Overall Software Quality Fewer bugs, more smiles: Clear and well-structured code is less prone to errors, leading to more stable and reliable software. Sustainable projects, not ticking time bombs: Readable code is easier to maintain and evolve, ensuring the software's long-term viability and resilience. Happy developers, happy users: When developers can work on code they understand and enjoy, they're more productive and engaged, leading to better software and ultimately, happier users. Welcome to the Jungle Imagine a small garden, teeming with life and beauty. This is your software codebase, initially small and manageable. As features accumulate and functionality grows, the garden turns into an ever-expanding jungle. Vines of connections intertwine, and dense layers of logic sprout. Complexity, like the jungle, becomes inevitable. But just as skilled explorers can navigate the jungle, understanding its hidden pathways and navigating its obstacles, so too can developers manage code complexity. Again, if careless decisions are made in the jungle, we may endanger ourselves or make our lives miserable. Here are a few things that we can do in the jungle, being aware of what can go wrong: Clearing Paths Refactoring acts like pruning overgrown sections, removing unnecessary code, and streamlining logical flows. This creates well-defined paths, making it easier to traverse the code jungle. However, careless actions can make the situation worse. Overzealous pruning with refactoring might sever crucial connections, creating dead ends and further confusion. Clearing paths needs precision and careful consideration about what paths we need and why. Building Bridges Design patterns can serve as metaphorical bridges, spanning across complex sections and providing clear, standardized ways to access different functionalities. They offer familiar structures within the intricate wilderness. Beware though, that building bridges with ill-suited design patterns or ill-implemented patterns can lead to convoluted detours and hinder efficient navigation. Building bridges requires understanding what needs to be bridged, why, and how. Mapping the Terrain Documentation acts as a detailed map, charting the relationships between different parts of the code. By documenting code clearly, developers have a reference point to navigate the ever-growing jungle. Keep in mind that vague and incomplete documentation becomes a useless map, leaving developers lost in the wilderness. Mapping the terrain demands accuracy and attention to detail. Controlling Growth While the jungle may expand, strategic planning helps manage its complexity. Using modularization, like dividing the jungle into distinct biomes, keeps different functionalities organized and prevents tangled messes. Uncontrolled growth due to poor modularisation may result in code that is impossible to maintain. Controlling growth necessitates strategic foresight. By approaching these tasks with diligence, developers can ensure the code jungle remains explorable, understandable, and maintainable. With tools, mechanisms, and strategies tailored to our specific context and needs, developers can navigate the inevitable complexity. Now, think about the satisfaction of emerging from the dense jungle, having not just tamed it, but having used its complexities to your advantage. That's the true power of managing code complexity in software development. Wrapping Up While completely eliminating complexity might be unrealistic, we can significantly reduce the rate of growth and actively manage complexity through deliberate practices and thoughtful architecture. Ultimately, the goal is to strike a balance between functionality and maintainability. While complexity is unavoidable, it's crucial to implement strategies that prevent it from becoming an obstacle in software development.
Estimating workloads is crucial in mastering software development. This can be achieved either as an ongoing development part of agile teams or in response to tenders as a cost estimate before migration, among other ways. The team responsible for producing the estimate regularly encounters a considerable workload, which can lead to significant time consumption if the costing is not conducted using the correct methodology. The measurement figures generated may significantly differ based on the efficiency of the technique employed. Additionally, misconceptions regarding validity requirements and their extent exist. This paper presents a novel hybrid method for software cost estimation that discretizes software into smaller tasks and uses both expert judgment and algorithmic techniques. By using a two-factor qualification system based on volumetry and complexity, we present a more adaptive and scalable model for estimating software project duration, with particular emphasis on large legacy migration projects. Table Of Contents Introduction Survey of Existing SCE2.1. Algorithmic Methods2.2. Non-algorithmic Methods2.3. AI-based Methods2.4. Agile Estimation Techniques Hybrid Model Approach3.1. Discretization3.2. Dual-factor Qualification System and Effort Calculation Task3.3. Abacus System Specific Use Case in Large Legacy Migration Projects4.1. Importance of SCE in Legacy Migration4.2. Application of the Hybrid Model4.3. Results and Findings Conclusion Introduction Software Cost Estimation (SCE) is a systematic and quantitative process within the field of software engineering that involves analyzing, predicting, and allocating the financial, temporal, and resource investments required for the development, maintenance, and management of software systems. This vital effort uses different methods, models, and techniques to offer stakeholders knowledgeable evaluations of the expected financial, time, and resource requirements for successful software project execution. It is an essential part of project planning, allowing for a logical distribution of resources and supporting risk assessment and management during the software development life cycle. Survey of Existing SCE Algorithmic Methods COCOMO Within the field of software engineering and cost estimation, the Constructive Cost Model, commonly referred to as COCOMO, is a well-established and highly regarded concept. Developed by Dr Barry Boehm, COCOMO examines the interplay between software attributes and development costs. The model operates on a hierarchy of levels, ranging from basic to detailed, with each level providing varying degrees of granularity [1]. The model carefully uses factors such as lines of code and other project details, aligning them with empirical cost estimation data. Nonetheless, COCOMO is not a stagnant vestige of the past. It has progressed over the years, with COCOMO II encompassing the intricacies of contemporary software development practices, notably amid constantly evolving paradigms like object-oriented programming and agile methodologies [2]. However, though COCOMO’s empirical and methodical approach provides credibility, its use of lines of code as a primary metric attracts criticism. This is particularly true for projects where functional attributes are of greater importance. Function Point Analysis (FPA) Navigating away from the strict confines of code metrics, Function Point Analysis (FPA) emerges as a holistic method for evaluating software from a functional perspective. Introduced by Allan Albrecht at IBM in the late 1970s, FPA aims to measure software by its functionality and the value it provides to users, rather than the number of lines of code. By categorizing and evaluating different user features — such as inputs, outputs, inquiries, and interfaces — FPA simplifies software complexity into measurable function points [3]. This methodology is particularly effective in projects where the functional output is of greater importance than the underlying code. FPA, which takes a user-focused approach, aligns well with customer demands and offers a concrete metric that appeals to developers and stakeholders alike. However, it is important to note that the effectiveness of FPA depends on a thorough comprehension of user needs, and uncertainties could lead to discrepancies in estimation. SLIM (Software Life Cycle Management) Rooted in the philosophy of probabilistic modeling, SLIM — an acronym for Software Life Cycle Management — is a multifaceted tool designed by Lawrence Putnam [4]. SLIM’s essence revolves around a set of non-linear equations that, when woven together, trace the trajectory of software development projects from inception to completion. Leveraging a combination of historical data and project specifics, SLIM presents a probabilistic landscape that provides insights regarding project timelines, costs, and potential risks. What distinguishes SLIM is its capability to adapt and reconfigure as projects progress. By persistently absorbing project feedback, SLIM dynamically refines its estimates to ensure they remain grounded in project actualities. This continuous recalibration is both SLIM’s greatest asset and its primary obstacle. While it provides flexible adaptability, it also requires detailed data recording and tracking, which requires a disciplined approach from project teams. Non-Algorithmic Methods Expert Judgement Treading the venerable corridors of software estimation methodologies, one cannot overlook the enduring wisdom of Expert Judgment [5]. Avoiding the rigorous algorithms and formalities of other techniques, Expert Judgment instead draws upon the accumulated experience and intuitive prowess of industry veterans. These experienced practitioners, with their wealth of insights gathered from a multitude of projects, have an innate ability to assess the scope, intricacy, and possible difficulties of new ventures. Their nuanced comprehension can bridge gaps left by more strictly data-driven models. Expert Judgment captures the intangible subtleties of a project artfully, encapsulating the software development craft in ways quantitative metrics may overlook. However, like any art form, Expert Judgment is subject to the quirks of its practitioners. It is vulnerable to personal biases and the innate variability of human judgment. Analogous Estimation (or Historical Data) Historical Data estimation, also known as Analogous Estimation, is a technique used to inform estimates for future projects by reviewing past ones. It is akin to gazing in the rearview mirror to navigate the path ahead. This method involves extrapolating experiences and outcomes of similar previous projects and comparing them to the current one. By doing so, it provides a grounded perspective tempered by real-world outcomes to inform estimates. Its effectiveness rests on its empirical grounding, with past events often offering reliable predictors for future undertakings. Nevertheless, the quality and relevance of historical data at hand are crucial factors. A mismatched comparison or outdated data can lead projects astray, underscoring the importance of careful data curation and prudent implementation [6]. Delphi Technique The method draws its name from the ancient Oracle of Delphi, and it orchestrates a harmonious confluence of experts. The Delphi Technique is a method that aims to reach a consensus by gathering anonymous insights and projections from a group of experts [7]. This approach facilitates a symposium of collective wisdom rather than relying on a singular perspective. Through iterative rounds of feedback, the estimates are refined and recalibrated based on the collective input. The Delphi Technique is a structured yet dynamic process that filters out outliers and converges towards a more balanced, collective judgment. It is iterative in nature and emphasizes anonymity to curtail the potential pitfalls of groupthink and influential biases. This offers a milieu where each expert’s voice finds its rightful resonance. However, the Delphi Technique requires meticulous facilitation and patience, as it journeys through multiple rounds of deliberation before arriving at a consensus. AI-Based Methods Machine Learning in SCE Within the rapidly evolving landscape of Software Cost Estimation, Machine Learning (ML) emerges as a formidable harbinger of change [8]. Unshackling from the deterministic confines of traditional methods, ML delves into probabilistic realms, harnessing vast swaths of historical data to unearth hidden patterns and correlations. By training on diverse project datasets, ML algorithms refine their predictive prowess, adapting to nuances often overlooked by rigid, rule-based systems. This adaptability positions ML as a particularly potent tool in dynamic software ecosystems, where project scopes and challenges continually morph. However, the effectiveness of ML in SCE hinges on the quality and comprehensiveness of the training data. Sparse or biased datasets can lead the algorithms astray, underlining the importance of robust data curation and validation. Neural Networks Venturing deeper into the intricate neural pathways of computational modeling, Neural Networks (NN) stand as a testament to the biomimetic aspirations of artificial intelligence. Structured to mimic the neuronal intricacies of the human brain, NNs deploy layered architectures of nodes and connections to process and interpret information. In the realm of Software Cost Estimation, Neural Networks weave intricate patterns from historical data, capturing nonlinear relationships often elusive to traditional models [9], [10]. Their capacity for deep learning, especially with the advent of multi-layered architectures, holds immense promise for SCE’s complex datasets. Yet, the very depth that lends NNs their power can sometimes shroud them in opacity. Their "black box" nature, combined with susceptibility to overfitting, necessitates meticulous training and validation to ensure reliable estimations. Also, the recent discovery of ”Grokking” suggests that this field could yield fascinating new findings [11]. Genetic Algorithms Drawing inspiration from the very fabric of life, Genetic Algorithms (GAs) transpose the principles of evolution onto computational canvases. GAs approach Software Cost Estimation as an optimization puzzle, seeking the fittest solutions through processes mimicking natural selection, crossover, and mutation. By initiating with a diverse population of estimation strategies and iteratively refining them through evolutionary cycles, GAs converge towards more optimal estimation models. Their inherent adaptability and explorative nature make them well-suited for SCE landscapes riddled with local optima [12]. However, the stochastic essence of GAs means that their results, while generally robust, may not always guarantee absolute consistency across runs. Calibration of their evolutionary parameters remains crucial to strike a balance between exploration and exploitation. Agile Estimation Techniques Agile methodologies, originally formulated to address the challenges of traditional software development processes, introduced a paradigm shift in how projects are managed and products are delivered. Integral to this approach is the iterative nature of development and the emphasis on collaboration among cross-functional teams. This collaborative approach extends to the estimation processes in Agile. Instead of trying to foresee the entirety of a project’s complexity at its outset, Agile estimation techniques are designed to evolve, adapting as the team gathers more information. Story Points Instead of estimating tasks in hours or days, many Agile teams use story points to estimate the relative effort required for user stories. Story points consider the complexity, risk, and effort of the task. By focusing on relative effort rather than absolute time, teams avoid the pitfalls of under or over-estimating due to unforeseen challenges or dependencies. Over several iterations, teams develop a sense of their ”velocity” — the average number of story points they complete in an iteration — which aids in forecasting [13]. Planning Poker One of the most popular Agile estimation techniques is Planning Poker. Team members, often inclusive of developers, testers, and product owners, collaboratively estimate the effort required for specific tasks or user stories. Using a set of cards with pre-defined values (often Fibonacci sequence numbers), each member selects a card representing their estimate. After revealing their cards simultaneously, discrepancies in estimates are discussed, leading to consensus [14], [15]. The beauty of Planning Poker lies in its ability to combine individual expert opinions and arrive at an estimate that reflects the collective wisdom of the entire team. The process also uncovers potential challenges or uncertainties, leading to more informed decision-making. Continuous Reevaluation A hallmark of Agile estimation is its iterative nature. As teams proceed through sprints or iterations, they continually reassess and adjust their estimates based on new learnings and the actual effort expended in previous cycles. This iterative feedback loop allows for more accurate forecasting as the project progresses [14]. Hybrid Model Approach We aim to present a novel model that incorporates both expert judgment and algorithmic approaches. While considering the expert approach, it is worth noting that it may involve subjective evaluations, possibly exhibiting inconsistencies amongst different experts. Besides, its dependence on the experience and availability of experts has the potential to introduce biases due to cognitive heuristics and over-reliance on recent experiences. On the other hand, an algorithmic approach may require significant expertise to be applied correctly and may focus on certain parameters, such as the number of lines of code, which may not be relevant. Therefore, the aim here is to propose a model that is independent of the programming language and considers multiple factors, such as project, hardware, and personnel attributes. Task Discretization In the constantly evolving field of software engineering, the practice of task discretization has become firmly established as a mainstay [16]. This approach stresses the importance of breaking down larger software objectives into manageable, bite-sized units. By acknowledging the inherent discreetness of software components — from screens and APIs to SQL scripts — a methodical breakdown emerges as a practical requirement [17]. Such an approach allows you to define your software in consistent modules, composed of consistent elements. It is crucial to have homogeneous elements for estimation, to enable the estimation team to easily understand what they are estimating and avoid the need to adapt to the elements. Those elements will be referred to as ”tasks” throughout the paper. Also, addressing tasks at an individual level improves accuracy, while the detail it provides promotes flexibility, allowing for iterative adjustments that accommodate a project’s changing requirements. Such an approach guarantees that each component’s distinct characteristics are appropriately and independently considered. This method of discretization has several advantages. Addressing tasks at an individual level enhances accuracy, while the granularity it brings forth promotes flexibility, enabling iterative adjustments that accommodate a project’s fluid requirements. Conversely, a detailed comprehension of each task’s complexities enables the prudent allocation of resources and fine-tuning skills to where they are most necessary. Nevertheless, this level of detail is not without drawbacks. Despite providing accuracy, deconstructing tasks into their constituent parts may result in administrative challenges, particularly in extensive projects. The possibility of neglecting certain tasks, albeit minor, is ever-present. Moreover, an excessively detailed approach can sometimes obscure wider project aims, resulting in decision-making delays, which is often referred to as ”analysis paralysis”. Dual-Factor Qualification System and Effort Calculation Task Discretization With the delineation of tasks exhibiting homogeneous attributes, it becomes imperative to pinpoint generic determinants for allocating appropriate effort. Upon meticulous scrutiny, two pivotal factors have been discerned: Complexity and Volumetry [1], [18]. Complexity serves as a metric to gauge the requisite technical acumen for task execution. For instance, within a user interface, the incorporation of a dynamic table may warrant the classification of the task as possessing high complexity due to its intricate requirements. Volumetry delineates the volume or quantum of work involved. To illustrate, in the context of a user interface, the presence of an extensive forty-field form might be indicative of a task with significant volumetry due to the sheer magnitude of its components. Both Complexity and Volumetry are in the interval [1 − 5] and must be integers. Now we will define the Effort (E), which is calculated as follows: E = C ∗V Where C is the Complexity and V the Volumetry. We utilize multiplication in this calculation in order to establish a connection between high Complexity and high Volumetry. This enables us to account for potential risks when both evaluation criteria increase simultaneously while maintaining accuracy for tasks with lower coefficients. By using a simple product of the two intervals of C and V, we obtain the following possibilities for E: [1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 20, 25] Abacus System Now that an effort has been obtained, the corresponding number of days can be identified for each effort value. This stage is critical and requires the intervention of an expert with knowledge of the target architecture and technologies. However, the model permits this crucial resource to intervene only once when establishing these values. Use of an Algorithm To establish these values, we propose using an algorithm to enhance accuracy and prevent errors.It can be utilized to simulate data sets using three distinct models and two starting criteria: The maximal number of days (which is linked with an effort of 25) The gap padding between values We utilized three distinct models to enable the experts and estimation team to select from different curve profiles that may yield varied characteristics, such as precision, risk assessment, and padding size, for ideal adaptation to the requirements. Three distinct mathematical models were hypothesized to explicate the relationship: linear, quadratic, and exponential. Each model postulates a unique behavior of effort-to-days transformation: The Linear Model postulates a direct proportionality between effort and days. The Quadratic Model envisages an accelerated growth rate, invoking polynomial mathematics. The Exponential Model projects an exponential surge, signifying steep escalation for higher effort values. Those models can be adjusted to more accurately meet estimation requirements. Finally, we obtain the following code: Python import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing pandas for tabular display # Fixed effort values efforts = np.array([1, 2, 3, 4, 5, 6, 8, 9, 10, 12, 15, 16, 20, 25]) # Parameters Max_days = 25 Step_Days = 0.25 def linear_model(effort, max_effort, max_days, step_days): slope = (max_days - step_days) / max_effort return slope * effort + step_days - slope def quadratic_model(effort, max_effort, max_days, step_days): scale = (max_days - step_days) / (max_effort + 0.05 * max_effort**2) return scale * (effort + 0.05 * effort**2) def exponential_model(effort, max_effort, max_days, step_days): adjusted_max_days = max_days - step_days + 1 base = np.exp(np.log(adjusted_max_days) / max_effort) return step_days + base ** effort - 1 def logarithmic_model(effort, max_effort, max_days, step_days): scale = (max_days - step_days) / np.log(max_effort + 1) return scale * np.log(effort + 1) # Rounding to nearest step def round_to_step(value, step): return round(value / step) * step linear_days = np.array([round_to_step(linear_model(e, efforts[-1], Max_days, Step_Days), Step_Days) for e in efforts]) quadratic_days = np.array([round_to_step(quadratic_model(e, efforts[-1], Max_days, Step_Days), Step_Days) for e in efforts]) exponential_days = np.array([round_to_step(exponential_model(e, efforts[-1], Max_days, Step_Days), Step_Days) for e in efforts]) logarithmic_days = np.array([round_to_step(logarithmic_model(e, efforts[-1], Max_days, Step_Days), Step_Days) for e in efforts]) # Plot plt.figure(figsize=(10,6)) plt.plot(efforts, linear_days, label="Linear Model", marker='o') plt.plot(efforts, quadratic_days, label="Quadratic Model", marker='x') plt.plot(efforts, exponential_days, label="Exponential Model", marker='.') plt.plot(efforts, logarithmic_days, label="Logarithmic Model", marker='+') plt.xlabel("Effort") plt.ylabel("Days") plt.title("Effort to Days Estimation Models") plt.legend() plt.grid(True) plt.show() # Displaying data in table format df = pd.DataFrame({ 'Effort': efforts, 'Linear Model (Days)': linear_days, 'Quadratic Model (Days)': quadratic_days, 'Exponential Model (Days)': exponential_days, 'Logarithmic Model (Days)': logarithmic_days }) display(df) Listing 1. Days generation model code, Python Simulations Let us now examine a practical example of chart generation. As previously stated in the code, the essential parameters ”Step Days” and ”Max days” have been set to 0.25 and 25, respectively. The results generated by the three models using these parameters are presented below. Figure 1: Effort to days estimation models - Data Below is a graphical representation of these results: Figure 2: Effort to days estimation models — graphical representation The graph enables us to distinguish the variation in ”compressions” amongst the three models, which will yield distinct traits, including accuracy in minimal forces or strong association among values. Specific Use Case in Large Legacy Migration Projects Now that the model has been described, a specific application will be proposed in the context of a migration project. It is believed that this model is well-suited to projects of this kind, where teams are confronted with a situation that appears unsuited to the existing standard model, as explained in the first part. Importance of SCE in Legacy Migration Often, migration projects are influenced by their cost. The need to migrate is typically caused by factors including: More frequent regressions and side effects Difficulty in locating new resources for outdated technologies Specialist knowledge concentration Complexity in integrating new features Performance issues All potential causes listed above increase cost and/or risk. It may be necessary to consider the migration of the problematic technological building block(s). Implementation depends mainly on the cost incurred, necessitating an accurate estimate [19]. However, it is important to acknowledge that during an organization’s migration process, technical changes must be accompanied by human and organizational adjustments. Frequently, after defining the target architecture and technologies, the organization might lack the necessary experts in these fields. This can complicate the ”Expert Judgement” approach. Algorithmic approaches do not appear to be suitable either, as they necessitate knowledge and mastery but also do not necessarily consider all the subtleties that migrations may require in terms of redrawing the components to be migrated. Additionally, the number of initial lines of code is not consistently a reliable criterion. Finally, AI-based methodologies seem to still be in their formative stages and may be challenging to implement and master for these organizations. That is why our model appears suitable, as it enables present teams to quantify the effort and then seek advice from an expert in the target technologies to create the estimate, thus obtaining an accurate figure. It is worth noting that this estimation merely encompasses the development itself and disregards the specification stages and associated infrastructurecosts. Application of the Hybrid Model We shall outline the entire procedure for implementing our estimation model. The process comprises three phases: Initialization, Estimation, and Finalization. Initialization During this phase, the technology building block to be estimated must first be deconstructed. It needs to be broken down into sets of unified tasks. For example, an application with a GWT front-end calling an AS400 database could be broken down into two main sets: Frontend: Tasks are represented by screens. Backend: Tasks are represented by APIs. We can then put together the estimation team. It does not need to be a technical expert in the target technology but should be made up of resources from the existing project, preferably a technical/functional pair, who can assess the complementarity of each task with the two visions. This team will be able to start listing the tasks for the main assemblies identified during the discretization process. Estimation We now have a team ready to assign Complexity and Volumetry values to the set of tasks to be identified. In parallel with this association work, we can begin to set values for the days to be associated with the effort. This work may require an expert in the target technologies and also members of the estimation team to quantify some benchmark values on the basis of which the expert can take a critical look and extend the results to the whole chart. At the end of this phase, we have a days/effort correspondence abacus and a list of tasks with an associated effort value. Finalization The final step is to calculate the conversion between effort and days using the abacus to obtain a total number of days. Once the list of effort values has been obtained, a risk analysis canbe carried out using the following criteria: The standard deviation of the probability density curve of efforts Analysis of whether certain ”zones” of the components concentrate high effort values The number of tasks with an effort value greater than 16 Depending on these criteria, specific measures can be taken in restricted areas. Results and Findings Finally, we arrive at the following process, which provides a hybrid formalization between expert judgment and algorithmic analysis. The method seems particularly well suited to the needs of migration projects, drawing on accessible resources and not requiring a high level of expertise. Figure 3: Complete process of the hybrid model Another representation, based on the nature of elements, could be the following: Figure 4: Complete process of the hybrid model Conclusion In conclusion, our model presents a practical and flexible approach for estimating the costs involved in large legacy migration projects. By combining elements of expert judgment with a structured, algorithmic analysis, this model addresses the unique challenges that come with migrating outdated or complex systems. It recognizes the importance of accurately gauging the effort and costs, considering not just the technical aspects but also the human and organizational shifts required. The three-phase process — Initialization, Estimation, and Finalization — ensures a comprehensive evaluation, from breaking down the project into manageable tasks to conducting a detailed risk analysis. This hybrid model is especially beneficial for teams facing the daunting task of migration, providing a pathway to make informed decisions and prepare effectively for the transition. Through this approach, organizations can navigate the intricacies of migration, ensuring a smoother transition to modern, more efficient systems. In light of the presented discussions and findings, it becomes evident that legacy migration projects present a unique set of challenges that can’t be addressed by conventional software cost estimation methods alone. The hybrid model as proposed serves as a promising bridge between the more heuristic expert judgment approach and the more structured algorithmic analysis, offering a balanced and adaptive solution. The primary strength of this model lies in its adaptability and its capacity to leverage both institutional knowledge and specific expertise in target technologies. Furthermore, the model’s ability to deconstruct a problem into sets of unified tasks and estimate with an appropriate level of granularity ensures its relevance across a variety of application scenarios. While the current implementation of the hybrid model shows potential, future research and improvements can drive its utility even further: Empirical validation: As with all models, empirical validation on a diverse set of migration projects is crucial. This would not only validate its effectiveness but also refine its accuracy.(We are already working on it.) Integration with AI: Although AI-based methodologies for software cost estimation are still nascent, their potential cannot be overlooked. Future iterations of the hybrid model could integrate machine learning for enhanced predictions, especially when large datasets from past projects are available. Improved risk analysis: The proposed risk analysis criteria provide a solid starting point. However, more sophisticated risk models, which factor in unforeseen complexities and uncertainties inherent to migration projects, could be integrated into the model. Tooling and automation: Developing tools that can semiautomate the process described would make the model more accessible and easier to adopt by organizations. In conclusion, the hybrid model presents a notable advancement in the realm of software cost estimation, especially for legacy migration projects. However, as with all models, it’s an evolving entity, and continued refinement will only enhance its applicability and effectiveness. References [1] Barry W. Boehm. Software engineering economics. IEEE Transactions on Software Engineering, SE-7(1):4–21, 1981. [2] Barry W. Boehm, Chris Abts, A. Winsor Brown, Sunita Chulani, Bradford K. Clark, Ellis Horowitz, Ray Madachy, Donald J. Reifer, and Bert Steece. Cost models for future software life cycle processes: Cocomo 2.0. Annals of Software Engineering, 1(1):57–94, 2000. [3] International Function Point Users Group (IFPUG). Function Point Counting Practices Manual. IFPUG, 2000. FPCPM. [4] L.H. Putnam. A general empirical solution to the macro software sizing and estimating problem. IEEE Transactions on Software Engineering, 4:345–361, 1978. [5] R.T. Hughes. Expert judgment as an estimating method. Information and Software Technology, 38(2):67–75, 1996. [6] Christopher Rush and Rajkumar Roy. Expert judgment in cost estimating: Modelling the reasoning process. Unknown Journal Name. [7] N. Dalkey. An experimental study of group opinion: the Delphi method. Futures, 1(5):408–426, 1969. [8] Yibeltal Assefa, Fekerte Berhanu, Asnakech Tilahun, and Esubalew Alemneh. Software effort estimation using machine learning algorithm. In 2022 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), pages 163–168, 2022. [9] A. Venkatachalam. Software cost estimation using artificial neural networks. In Proc. Int. Conf. Neural Netw. (IJCNN-93-Nagoya Japan), volume 1, pages 987–990, Oct 1993. [10] R. Poonam and S. Jain. Enhanced software effort estimation using multi-layered feed forward artificial neural network technique. Procedia Computer Science, 89:307–312, 2016. [11] Alethea Power, Yuri Burda, Harri Edwards, et al. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv preprint arXiv:2201.02177, 2022. [12] B.K. Singh and A.K. Misra. Software effort estimation by genetic algorithm tuned parameters of modified constructive cost model for NASA software projects. International Journal of Computer Applications, 59:22–26, 2012. [13] K. Hrvoje and S. Gotovac. Estimating software development effort using Bayesian networks. In 2015 23rd International Conference on Software, Telecommunications and Computer Networks, pages 229–233, Split, Croatia, September 16–18 2015. [14] M. Cohn. Agile Estimating and Planning. Prentice Hall PTR, 2005. [15] Saurabh Bilgaiyan, Santwana Sagnika, Samaresh Mishra, and Madhabananda Das. A systematic review on software cost estimation in agile software development. Journal of Engineering Science and Technology Review, 10(4):51–64, 2017. [16] S. McConnell. Software Estimation: Demystifying the Black Art. Microsoft Press, 2006. [17] C. Szyperski. Component Software: Beyond Object-Oriented Programming. Addison-Wesley, 2nd edition, 2002. [18] N. E. Fenton and S. L. Pfleeger. Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., 1997. [19] Harry M. Sneed and Chris Verhoef. Cost-driven software migration: An experience report. Software: Practice and Experience, 2020.
In this fascinating talk, Michael Lloyd introduced the concept of dysfunction mapping, a tool developed over years of trial and error aimed at creating a repeatable way to find, theme, and ultimately solve organizational dysfunction. Abstract Dysfunction mapping is a tool developed over years of trial and error, aimed at creating a repeatable way to find, theme, and ultimately, solve organizational dysfunction. By following these steps, you can more quickly identify the biggest wins, develop a solid action plan, and measure if you’re really achieving outcomes that matter. It’s not a silver bullet, but it can give you some structure to creatively solve problems while also making your value visible and your goals clear. During the Q&A part on humble planning, Maarten answered the following questions, among others: What about other common iterative change management strategies? Which have you tried, and why have they failed for you? Why do you think they failed in your experience? Curious why the mapping moves from Right to Left (+2) A system diagram of influencing factors often contains reinforcing loops – how do we make that work with dysfunction trees that do not allow for loops? Do you worry when you have identified a “dysfunction” that you could go deeper into and perhaps get more possibilities? I am thinking of “The Dangers of the 5 Whys” and the advantages of systems thinking instead of reductionist cause-effect models. This seems focused on practices/processes. What is your approach to the foundational issues of mindset, values, behavior, and culture? Are the symptom cards available on the Honest Agile website? How do you define a purpose statement for dysfunctions? How do we get the team’s buy-in for these statements? How do you prioritize what to pick up first/next? I am curious about the clustering of symptoms and finding the real causes. Is it a typical root cause analysis? Do you identify impacts or the symptoms/dysfunctions or do anything else to understand how to prioritize which to address? Understanding the impact seems to be a missing piece. Are you sharing all your written-down symptoms with the team? “This all is what you are doing wrong. . . " The mindset-values-principles-practices slide could be interpreted as there is only one mindset. Do you think the Scrum anti-patterns Stefan identified could automatically fill your diagram? How do we order the implementation of the solutions? Is it, in your opinion, reasonable to include the DM in the OKRs? Symptom metrics are mostly output metrics? Could we also measure against the root cause? How do you create a single Sprint/cycle goal when doing Kanban with, e.g., 4 product goals? Video: Dysfunction Mapping With Michael Lloyd Meet Michael Lloyd Michael serves as a distinguished Agile Coach, Scrum Master, and authentic leader, boasting eight years of expertise in enhancing team and organizational performance to increase value delivery frequency. As the Head of Global Agility at Honest Agile, his mission is to influence agile practices globally by assisting agile practitioners in addressing actual challenges. Connect with Michael Lloyd on LinkedIn.
Platform engineering is the creation and management of foundational infrastructure and automated processes, incorporating principles like abstraction, automation, and self-service, to empower development teams, optimize resource utilization, ensure security, and foster collaboration for efficient and scalable software development. In today's fast-paced world of software development, the evolution of "platform engineering" stands as a transformative force, reshaping the landscape of software creation and management. This comprehensive exploration aims to demystify the intricate realm of platform engineering, shedding light on its fundamental principles, multifaceted functions, and its pivotal role in revolutionizing streamlined development processes across industries. Key Concepts and Principles Platform engineering encompasses several key concepts and principles that underpin the design and implementation of internal platforms. One fundamental concept is abstraction, which involves shielding developers from the complexities of underlying infrastructure through well-defined interfaces. Automation is another crucial principle, emphasizing the use of scripting and tools to streamline repetitive tasks, enhance efficiency, and maintain consistency in development processes. Self-service is pivotal, empowering development teams to independently provision and manage resources. Scalability ensures that platforms can efficiently adapt to varying workloads, while resilience focuses on the system's ability to recover from failures. Modularity encourages breaking down complex systems into independent components, fostering flexibility and reusability. Consistency promotes uniformity in deployment and configuration, aiding troubleshooting and stability. API-first design prioritizes the development of robust interfaces, and observability ensures real-time monitoring and traceability. Lastly, security by design emphasizes integrating security measures throughout the entire development lifecycle, reinforcing the importance of a proactive approach to cybersecurity. Together, these concepts and principles guide the creation of robust, scalable, and developer-friendly internal platforms, aligning with the evolving needs of modern software development. Diving Into the Role of a Platform Engineering Team The platform engineering team operates at the intersection of software development, operational efficiency, and infrastructure management. Their primary objective revolves around sculpting scalable and efficient internal platforms that empower developers. Leveraging automation, orchestration, and innovative tooling, these teams create standardized environments for application deployment and management, catalyzing productivity and performance. Image source Elaborating further on the team's responsibilities, it's essential to highlight their continuous efforts in optimizing resource utilization, ensuring security and compliance, and establishing robust monitoring and logging mechanisms. Their role extends beyond infrastructure provisioning, encompassing the facilitation of collaboration among development, operations, and security teams to achieve a cohesive and agile software development ecosystem. Building Blocks of Internal Platforms Central to platform engineering is the concept of an Internal Developer Platform (IDP) - a tailored environment equipped with an array of tools, services, and APIs. This environment streamlines the development lifecycle, offering self-service capabilities that enable developers to expedite the build, test, deployment, and monitoring of applications. Internal platforms in the context of platform engineering encompass various components that work together to provide a unified and efficient environment for the development, deployment, and management of applications. The specific components may vary depending on the platform's design and purpose, but here are some common components: Infrastructure as Code (IaC) Containerization and orchestration Service mesh API Gateway CI/CD pipelines Monitoring and logging Security components Database and data storage Configuration management Workflow orchestration Developer tools Policy and governance Benefits of Internal Platforms Internal platforms in platform engineering offer a plethora of benefits, transforming the software development landscape within organizations. These platforms streamline and accelerate the development process by providing self-service capabilities, enabling teams to independently provision resources and reducing dependencies on dedicated operations teams. Automation through CI/CD pipelines enhances efficiency and ensures consistent, error-free deployments. Internal platforms promote scalability, allowing organizations to adapt to changing workloads and demands. The modularity of these platforms facilitates code reusability, reducing development time and effort. By abstracting underlying infrastructure complexities, internal platforms empower developers to focus on building applications rather than managing infrastructure. Collaboration is enhanced through centralized tools, fostering communication and knowledge sharing. Additionally, internal platforms contribute to improved system reliability, resilience, and observability, enabling organizations to deliver high-quality, secure software at a faster pace. Overall, these benefits make internal platforms indispensable for organizations aiming to stay agile and competitive in the ever-evolving landscape of modern software development. Challenges in Platform Engineering Platform engineering, while offering numerous benefits, presents a set of challenges that organizations must navigate. Scalability issues can arise as the demand for resources fluctuates, requiring careful design and management to ensure platforms can efficiently scale. Maintaining a balance between modularity and interdependence poses a challenge, as breaking down systems into smaller components can lead to complexity and potential integration challenges. Compatibility concerns may emerge when integrating diverse technologies, requiring meticulous planning to ensure seamless interactions. Cultural shifts within organizations may be necessary to align teams with the principles of platform engineering, and skill gaps may arise, necessitating training initiatives. Additionally, achieving consistency across distributed components and services can be challenging, impacting the reliability and predictability of the platform. Balancing security measures without hindering development speed is an ongoing challenge, and addressing these challenges demands a holistic and strategic approach to platform engineering that considers technical, organizational, and cultural aspects. Implementation Strategies in Platform Engineering Following are the top five implementation strategies: Start small and scale gradually: Begin with a focused and manageable scope, such as a pilot project or a specific team. This allows for the identification and resolution of any initial challenges in a controlled environment. Once the initial implementation proves successful, gradually scale the platform across the organization. Invest in training and skill development: Provide comprehensive training programs to ensure that development and operations teams are well-versed in the tools, processes, and concepts associated with platform engineering. Investing in skill development ensures that teams can effectively utilize the platform and maximize its benefits. Automate key processes with CI/CD: Implement Continuous Integration (CI) and Continuous Deployment (CD) pipelines to automate crucial aspects of the development lifecycle, including code building, testing, and deployment. Automation accelerates development cycles, reduces errors, and enhances overall efficiency. Cultivate DevOps practices: Embrace DevOps practices that foster collaboration and communication between development and operations teams. promotes shared responsibility, collaboration, and a holistic approach to software development, aligning with the principles of platform engineering. Iterative improvements based on feedback: Establish a feedback loop to gather insights and feedback from users and stakeholders. Regularly review performance metrics, user experiences, and any challenges faced during the implementation. Use this feedback to iteratively improve the platform, addressing issues and continuously enhancing its capabilities. These top five strategies emphasize a phased and iterative approach, coupled with a strong focus on skill development, automation, and collaborative practices. Starting small, investing in training, and embracing a DevOps culture contribute to the successful implementation and ongoing optimization of platform engineering practices within an organization. Platform Engineering Tools Various tools aid platform engineering teams in building, maintaining, and optimizing platforms. Examples include: Backstage: Developed by Spotify, it offers a unified interface for accessing essential tools and services. Kratix: An open-source tool designed for infrastructure management and streamlining development processes Crossplane: An open-source tool automating infrastructure via declarative APIs, supporting tailored platform solutions Humanitec: A comprehensive platform engineering tool facilitating easy platform building, deployment, and management Port: A platform enabling the building of developer platforms with a rich software catalog and role-based access control Case Studies of Platform Engineering Spotify Spotify is known for its adoption of a platform model to empower development teams. They use a platform called "Backstage," which acts as an internal developer portal. Backstage provides a centralized location for engineers to discover, share, and reuse services, tools, and documentation. It streamlines development processes, encourages collaboration, and improves visibility into the technology stack. Netflix Netflix is a pioneer in adopting a microservices architecture and has developed an internal platform called the Netflix Internal Platform Engineering (NIPE). The platform enables rapid application deployment, facilitates service discovery, and incorporates fault tolerance. Uber Uber has implemented an internal platform called "Michelangelo" to streamline machine learning (ML) workflows. Michelangelo provides tools and infrastructure to support end-to-end ML development, from data processing to model deployment. Salesforce Salesforce has developed an internal platform known as "Salesforce Lightning Platform." This platform enables the creation of custom applications and integrates with the Salesforce ecosystem. It emphasizes low-code development, allowing users to build applications with minimal coding, accelerating the development process, and empowering a broader range of users. Distinguishing Platform Engineering From SRE While both platform engineering and Site Reliability Engineering (SRE) share goals of ensuring system reliability and scalability, they diverge in focus and approach. Platform engineering centers on crafting foundational infrastructure and tools for development, emphasizing the establishment of internal platforms that empower developers. In contrast, SRE focuses on operational excellence, managing system reliability, incident response, and ensuring the overall reliability, availability, and performance of production systems. Further Reading: Top 10 Open Source Projects for SREs and DevOps Engineers. ACTORS Platform Engineering SRE Scope Focused on creating a development-friendly platform and environment. Focused on reliability and performance of applications and services in production. Responsibilities Platform Engineers design and maintain internal platforms, emphasizing tools and services for development teams. SREs focus on operational aspects, automating tasks, and ensuring the resilience and reliability of production systems. Abstraction Level Platform Engineering abstracts infrastructure complexities for developers, providing a high-level platform. SRE deals with lower-level infrastructure details, ensuring the reliability of the production environment. DevOps vs Platform Engineering DevOps and platform engineering are distinct methodologies addressing different aspects of software development. DevOps focuses on collaboration and automation across the entire software delivery lifecycle, while platform engineering concentrates on providing a unified and standardized platform for developers. The table below outlines the differences between DevOps and platform engineering. Factors DevOps Platform Engineering Objective Streamline development and operations Provide a unified and standardized platform for developers Principles Collaboration, Automation, CI, CD Enable collaboration, Platform as a Product, Abstraction, Standardization, Automation Scope Extends to the entire software delivery lifecycle Foster collaboration between dev and ops teams, providing a consistent environment for the entire lifecycle Tools Uses a wide range of tools at different stages in the lifecycle Integrates a diverse set of tools into the platform Benefits Faster development & deployment cycles, higher collaboration Efficient and streamlined development environment, improved productivity, and flexibility for developers Future Trends in Platform Engineering Multi-cloud and hybrid platforms: Platform engineering is expected to focus on providing solutions that seamlessly integrate and manage applications across different cloud providers and on-premises environments. Edge computing platforms: Platforms will need to address challenges related to latency, connectivity, and management of applications deployed closer to end-users. AI-driven automation: The integration of artificial intelligence (AI) and machine learning (ML) into platform engineering is expected to increase. AI-driven automation can optimize resource allocation, improve predictive analytics for performance monitoring, and enhance security measures within platforms. Serverless architectures: Serverless computing is anticipated to become more prevalent, leading to platform engineering solutions that support serverless architectures. This trend focuses on abstracting server management, allowing developers to focus solely on writing code. Observability and AIOps: Observability, including monitoring, tracing, and logging, will continue to be a key focus. AIOps (Artificial Intelligence for IT Operations) will likely play a role in automating responses to incidents and predicting potential issues within platforms. Low-code/no-code platforms: The rise of low-code/no-code platforms is likely to influence platform engineering, enabling a broader range of users to participate in application development with minimal coding. Platform engineering will need to support and integrate with these development approaches. Quantum computing integration: As quantum computing progresses, platform engineering may need to adapt to support the unique challenges and opportunities presented by quantum applications and algorithms. Zero Trust Security: Zero Trust Security models are becoming increasingly important. Future platform engineering will likely focus on implementing and enhancing security measures at every level, considering the principles of zero trust in infrastructure and application security.
The history of DevOps is definitely worth reading in a few good books about it. On that topic, “The Phoenix Project,” self-characterized as “a novel of IT and DevOps,” is often mentioned as a must-read. Yet for practitioners like myself, a more hands-on one is “The DevOps Handbook” (which shares Kim as an author in addition to Debois, Willis, and Humble) that recounts some of the watershed moments around the evolution of software engineering and provides good references around implementation. This book actually describes how to replicate the transformation explained in the Phoenix Project and provides case studies. In this brief article, I will use my notes on this great book to regurgitate a concise history of DevOps, add my personal experience and opinion, and establish a link to Cloud Development Environments (CDEs), i.e., the practice of providing access to and running, development environments online as a service for developers. In particular, I explain how the use of CDEs concludes the effort of bringing DevOps “fully online.” Explaining the benefits of this shift in development practices, plus a few personal notes, is my main contribution in this brief article. Before clarifying the link between DevOps and CDEs, let’s first dig into the chain of events and technical contributions that led to today’s main methodology for delivering software. The Agile Manifesto The creation of the Agile Manifesto in 2001 sets forth values and principles as a response to more cumbersome software development methodologies like Waterfall and the Rational Unified Process (RUP). One of the manifesto's core principles emphasizes the importance of delivering working software frequently, ranging from a few weeks to a couple of months, with a preference for shorter timescales. The Agile movement's influence expanded in 2008 during the Agile Conference in Toronto, where Andrew Shafer suggested applying Agile principles to IT infrastructure rather than just to the application code. This idea was further propelled by a 2009 presentation at the Velocity Conference, where a paper from Flickr demonstrated the impressive feat of "10 deployments a day" using Dev and Ops collaboration. Inspired by these developments, Patrick Debois organized the first DevOps Days in Belgium, effectively coining the term "DevOps." This marked a significant milestone in the evolution of software development and operational practices, blending Agile's swift adaptability with a more inclusive approach to the entire IT infrastructure. The Three Ways of DevOps and the Principles of Flow All the concepts that I discussed so far are today incarnated into the “Three Ways of DevOps,” i.e., the foundational principles that guide the practices and processes in DevOps. In brief, these principles focus on: Improving the flow of work (First Way), i.e., the elimination of bottlenecks, reduction of batch sizes, and acceleration of workflow from development to production, Amplifying feedback loops (Second Way), i.e., quickly and accurately collect information about any issues or inefficiencies in the system and Fostering a culture of continuous learning and experimentation (Third Way), i.e., encouraging a culture of continuous learning and experimentation. Following the leads from Lean Manufacturing and Agile, it is easy to understand what led to the definition of the above three principles. I delve more deeply into each of these principles in this conference presentation. For the current discussion, though, i.e., how DevOps history leads to Cloud Development Environments, we just need to look at the First Way, the principle of flow, to understand the causative link. Chapter 9 of the DevOps Handbook explains that the technologies of version control and containerization are central to implementing DevOps flows and establishing a reliable and consistent development process. At the center of enabling the flow is the practice of incorporating all production artifacts into version control to serve as a single source of truth. This enables the recreation of the entire production environment in a repeatable and documented fashion. It ensures that production-like code development environments can be automatically generated and entirely self-serviced without requiring manual intervention from Operations. The significance of this approach becomes evident at release time, which is often the first time where an application's behavior is observed in a production-like setting, complete with realistic load and production data sets. To reduce the likelihood of issues, developers are encouraged to operate production-like environments on their workstations, created on-demand and self-serviced through mechanisms such as virtual images or containers, utilizing tools like Vagrant or Docker. Putting these environments under version control allows for the entire pre-production and build processes to be recreated. Note that production-like environments really refer to environments that, in addition to having the same infrastructure and application configuration as the real production environments, also contain additional applications and layers necessary for development. Developers are encouraged to operate production-like environments (Docker icon) on their workstations using mechanisms such as virtual images or containers to reduce the likelihood of execution issues in production. From Developer Workstations to a CDE Platform The notion of self-service is already emphasized in the DevOps Handbook as a key enabler to the principle of flow. Using 2016 technology, this is realized by downloading environments to the developers’ workstations from a registry (such as DockerHub) that provides pre-configured, production-like environments as files (dubbed infrastructure-as-code). Docker is often a tool to implement this function. Starting from this operation, developers create an application in effect as follows: They access and copy files with development environment information to their machines, Add source code to it in the local storage, and Build the application locally using their workstation computing resources. This is illustrated in the left part of the figure below. Once the application works correctly, the source code is sent (“pushed) to a central code repository, and the application is built and deployed online, i.e., using Cloud-based resources and applications such as CI/CD pipelines. The three development steps listed above are, in effect, the only operations in addition to the authoring of source code using an IDE that is “local,” i.e., they use workstations’ physical storage and computing resources. All the rest of the DevOps operations are performed using web-based applications and used as-a-service by developers and operators (even when these applications are self-hosted by the organization.). The basic goal of Cloud Development Environments is to move these development steps online as well. To do that, CDE platforms, in essence, provide the following basic services, illustrated in the right part of the figure below: Manage development environments online as containers or virtual machines such that developers can access them fully built and configured, substituting step (1) above; then Provide a mechanism for authoring source code online, i.e., inside the development environment using an IDE or a terminal, substituting step (2); and finally Provide a way to execute build commands inside the development environment (via the IDE or terminal), substituting step (3). Figure: (left) The classic development data flow requires the use of the local workstation resources. (right) The cloud development data flow replaced local storage and computing while keeping a similar developer experience. On each side, operations are (1) accessing environment information, (2) adding code, and (3) building the application. Note that the replacement of step (2) can be done in several ways. For example, for example, the IDE can be browser-based (aka a Cloud IDE), or a locally installed IDE can implement a way to remotely author the code in the remote environment. It is also possible to use a console text editor via a terminal such as vim. I cannot conclude this discussion without mentioning that, often multiple containerized environments are used for testing on the workstation, in particular in combination with the main containerized development environment. Hence, cloud IDE platforms need to reproduce the capability to run containerized environments inside the Cloud Development Environment (itself a containerized environment). If this recursive process becomes a bit complicated to grasp, don’t worry; we have reached the end of the discussion and can move to the conclusion. What Comes Out of Using Cloud Development Environments in DevOps A good way to conclude this discussion is to summarize the benefits of moving development environments from the developers’ workstations online using CDEs. As a result, the use of CDEs for DevOps leads to the following advantages: Streamlined Workflow: CDEs enhance the workflow by removing data from the developer's workstation and decoupling the hardware from the development process. This ensures the development environment is consistent and not limited by local hardware constraints. Environment Definition: With CDEs, version control becomes more robust as it can uniformize not only the environment definition but all the tools attached to the workflow, leading to a standardized development process and consistency across teams across the organization. Centralized Environments: The self-service aspect is improved by centralizing the production, maintenance, and evolution of environments based on distributed development activities. This allows developers to quickly access and manage their environments without the need for Operations manual work. Asset Utilization: Migrating the consumption of computing resources from local hardware to centralized and shared cloud resources not only lightens the load on local machines but also leads to more efficient use of organizational resources and potential cost savings. Improved Collaboration: Ubiquitous access to development environments, secured by embedded security measures in the access mechanisms, allows organizations to cater to a diverse group of developers, including internal, external, and temporary workers, fostering collaboration across various teams and geographies. Scalability and Flexibility: CDEs offer scalable cloud resources that can be adjusted to project demands, facilitating the management of multiple containerized environments for testing and development, thus supporting the distributed nature of modern software development teams. Enhanced Security and Observability: Centralizing development environments in the Cloud not only improves security (more about secure CDEs) but also provides immediate observability due to their online nature, allowing for real-time monitoring and management of development activities. By integrating these aspects, CDEs become a solution for modern, in particular cloud-native software development, and align with the principles of DevOps to improve flow, but also feedback, and continuous learning. In an upcoming article, I will discuss the contributions of CDEs across all three ways of DevOps. In the meantime, you're welcome to share your feedback with me.
The evolution of enterprise software engineering has been marked by a series of "less" shifts — from client-server to web and mobile ("client-less"), data center to cloud ("data-center-less"), and app server to serverless. These transitions have simplified aspects of software engineering, including deployment and operation, allowing users to focus less on the underlying systems and more on the application itself. This trend of radical simplification now leads us to the next significant shift in enterprise software engineering: moving from platforms to a "platformless" approach. The Challenges of Platform-Based Approaches In recent years, the rise of enterprise software delivery platforms, often built on Kubernetes and other cluster management systems, has transformed the way organizations deploy and manage applications. They enable rapid, scalable application deployment and the ability to incrementally roll out and roll back updates. This agility in improving application function and performance is vital for business success. However, this complexity has introduced new challenges, including the need for large, highly skilled platform engineering teams and intricate links between various systems like DevOps pipelines, deployment management, monitoring systems SecOps, and site reliability engineering (SRE). Additionally, platform engineering places a predominant emphasis on the delivery of software rather than the entire software engineering lifecycle. The Need for a New Paradigm: Platformless To overcome these challenges, there’s a clear need for a paradigm shift. We need to move the focus away from building, manufacturing, and managing platforms to a more straightforward "platformless" approach. This does not imply the elimination of platforms, but rather the creation of a boundary that makes the platform invisible to the user. In this new paradigm, the focus shifts from managing platforms to developing, building, and deploying applications with seamless integration and monitoring — but without the intricacies of platform management. Defining Platformless Platformless is an innovative concept combining four technology domains: API-First, Cloud-Native Middleware, Platform Engineering, and Developer Experience (DX). This blend allows for a holistic approach to enterprise software engineering, covering the entire lifecycle and delivering a truly platformless experience. API-First: This approach emphasizes making all functionalities available as APIs, events, and data products, ensuring easy discovery and consumption. The focus here is on designing, governing, and managing APIs to ensure consistency and ease of use across the enterprise. In a platformless environment, this API-First approach is further enhanced as all network-exposed capabilities become APIs by default, streamlining governance and management, and shifting the enterprise's focus to leveraging these APIs as a comprehensive software development kit (SDK) for the business. Cloud-Native Middleware: This component involves building and operating systems in a scalable, secure, resilient multi-cloud environment. It encompasses domain-driven design, cell-based architecture, service meshes, integrated authentication and authorization, and zero-trust architecture. Platformless architecture integrates all these components, simplifying the challenges of building and managing cloud-native infrastructure and allowing enterprises to focus more on delivering value. Platform Engineering: This involves creating toolchains and processes for easy, self-service software building, delivery, and operation. Internal Developer Platforms (IDP) born from this discipline support various roles in software delivery, including developers, testers, and operations teams. In a platformless context, these platforms become central to facilitating the software engineering process, allowing each party to concentrate solely on their areas of responsibility and expertise. Developer Experience (DX): As the heart of platformless, DX focuses on making the development environment seamless and intuitive. It includes integrated development environments, command-line interfaces, well-designed web experiences, and comprehensive documentation. DX directly impacts the productivity and creativity of developers, driving better software quality, quicker market delivery, and overall, a happier and more innovative development team. Streamlining Enterprise Software Development and Delivery With a Platformless Approach In enterprise software engineering, the shift to platformless significantly simplifies the development and management of large enterprise application systems that deliver digital experiences. As businesses evolve, they require an ecosystem of interconnected software products, ranging from user-facing apps to autonomous network programs. Platformless facilitates this by enabling the seamless integration of diverse digital assets across various business domains. It streamlines the creation of modular, secure, and reusable architectures, while also enhancing delivery through rapid deployment, continuous integration, and efficient management. This approach allows enterprises to focus on innovation and value delivery, free from the complexities of traditional platform-based systems. For example, with a platformless environment, a developer can integrate a company's system of records with multiple web, mobile, and IoT applications; discover APIs; use languages or tools of their choice; and deploy application components such as APIs, integrations, and microservices in a zero-trust environment — all without managing the underlying platform. Ultimately, this leads to improved efficiency and a greater focus on problem-solving for better business results. The journey from software delivery platforms to a platformless approach represents a major leap in the evolution of enterprise application development and delivery. While retaining the benefits of scalability and rapid deployment, platformless simplifies and enhances the development experience, focusing on the applications rather than the platform. This shift not only streamlines the development process but also promises to deliver superior applications to customers — ultimately driving business innovation and growth.
In the first two parts of our series “Demystifying Event Storming,” we embarked on a journey through the world of Event Storming, an innovative approach to understanding complex business domains and software systems. We started by exploring the fundamentals of Event Storming, understanding its collaborative nature and how it differs from traditional approaches. In Part 2, we delved deeper into process modeling, looking at how Event Storming helps in mapping out complex business processes and interactions. Now, in Part 3, we will focus on the design-level aspect of Event Storming. This stage is crucial for delving into the technical aspects of system architecture and design. Here, we’ll explore how to identify aggregates – a key component in domain-driven design – and how they contribute to creating robust and scalable systems. This part aims to provide practical insights into refining system design and ensuring that it aligns seamlessly with business needs and objectives. Stay tuned as we continue to unravel the layers of Event Storming, providing you with the tools and knowledge to effectively apply this technique in your projects. Understanding The Visual Model The Event Storming Visual Model, as discussed in previous articles, depicts a dynamic and interactive model for system design, highlighting the flow from real-world actions to system outputs and policies. As previously explained, commands are depicted as decisive actions that trigger system operations, while events are the outcomes or results of those actions within the system. These concepts, which we’ve explored in earlier sections, help guide system design. Policies, which we’ve also covered in previous articles, serve as the guidelines or business rules that dictate how events are handled, ensuring system behavior aligns with business objectives. The read model, as previously discussed, represents the information structure affected by events, influencing future system interactions. Sketches and user inputs, as mentioned before, provide context and detail, enhancing the understanding of the system’s workings. Lastly, hotspots, which we’ve touched upon in prior discussions, are identified as critical areas needing scrutiny or improvement, often sparking in-depth discussions and problem-solving during an Event Storming session. This comprehensive model, as previously emphasized, underpins Event Storming’s utility as a collaborative tool, enabling stakeholders to collectively navigate and design complex software architectures. Abstraction Levels Event Storming is used to design set of software artifacts that enforce domain logic and bussiness consistency. - Alberto Brandolini In fact, Event Storming is a powerful technique used to map the intricacies of a system at varying levels of abstraction. This collaborative method enables teams to visualize and understand the flow of events, actions, and policies within a domain. Big Picture Level At the Big Picture Level of Event Storming, the primary goal is to establish an overarching view of the system. This stage serves as the foundation for the entire process. Participants collaborate to identify major domains or subdomains within the system, often referred to as “big picture” contexts. These contexts represent high-level functional areas or components that play essential roles in the system’s operation. The purpose of this level is to provide stakeholders with a holistic understanding of the system’s structure and architecture. Sticky notes and a large canvas are used to visually represent these contexts, with each context being named and briefly described. This visualization offers clarity on the overall system landscape and helps align stakeholders on the core domains and areas of focus. During this stage, participants also focus on identifying and documenting potential conflicts within the system. Conflicts may arise due to overlapping responsibilities, resource allocation, or conflicting objectives among different domains. Recognizing these conflicts early allows teams to address them proactively, minimizing challenges during the later stages of design and development. In addition to conflicts, participants at the Big Picture Level work to define the system’s goals. These goals serve as the guiding principles that drive the system’s design and functionality. Clear and well-defined goals help ensure that the subsequent design decisions align with the system’s intended purpose and objectives. Blockers, which are obstacles or constraints that can impede the system’s progress, are another key consideration at this level. Identifying blockers early in the process enables teams to devise strategies to overcome them effectively, ensuring smoother system implementation. Conceptual boundaries define the scope and context of each domain or subdomain. Understanding these boundaries is essential for ensuring that the system operates seamlessly within its defined constraints. The Big Picture serves as a starting point for addressing these elements, allowing stakeholders to gain insights into the broader challenges and opportunities within the system. This comprehensive view not only aids in understanding the system’s structure but also lays the groundwork for addressing these elements in subsequent levels of abstraction during Event Storming. Process Level The Process Level of Event Storming delves deeper into the specific business processes or workflows within each identified context or domain. Participants collaborate to define the sequence of events and actions that occur during these processes. The primary goal is to visualize and understand the flow of actions and events that drive the system’s behavior. This level helps uncover dependencies, triggers, and outcomes within processes, providing a comprehensive view of how the system operates in response to various inputs and events. Sticky notes are extensively used to represent events and commands within processes, and the flow is mapped on the canvas. This visual representation clearly shows how events and actions connect to achieve specific objectives, offering insights into process workflows. At the Process Level, it’s essential to identify the value proposition, which outlines the core benefits that the system or process delivers to its users or stakeholders. Understanding the value proposition helps participants align their efforts with the overall objectives and ensures that the designed processes contribute to delivering value. Policies represent the rules, guidelines, and business logic that govern how events are handled within the system. They define the behavior and decision-making criteria for various scenarios. Recognizing policies during Event Storming ensures that participants consider the regulatory and compliance aspects that impact the processes. Personas are fictional characters or user profiles that represent different types of system users. These personas help in empathizing with the end-users and understanding their needs, goals, and interactions with the system. Incorporating personas into the process level enables participants to design processes that cater to specific user requirements. Individual Goals refer to the objectives and intentions of various actors or participants within the system. Identifying individual goals helps in mapping out the motivations and expected outcomes of different stakeholders. It ensures that the processes align with the diverse goals of the involved parties. Design Level At the Design Level of Event Storming, the focus shifts to the internal behavior of individual components or aggregates within the system. Participants work together to model the commands, events, and policies that govern the behavior of these components. This level allows for a more granular exploration of system behavior, enabling participants to define the contracts and interactions between different parts of the system. Sticky notes continue to be utilized to represent commands, events, and policies at this level. These notes provide a detailed view of the internal workings of components, illustrating how they respond to commands, emit events, and enforce policies. The Design Level is crucial for defining the behavior and logic within each component, ensuring that the system functions as intended and aligns with business objectives. Identifying Aggregates Event Storming intricately intertwines with the principles and vocabulary of Domain-Driven Design (DDD) to model and elucidate technical concepts. We reach a crucial and often challenging aspect of Domain-Driven Design (DDD) – understanding and identifying aggregates. Aggregates, despite being a fundamental part of DDD, are commonly one of the least understood concepts by engineers. This lack of clarity can lead to significant pitfalls in both system design and implementation. Aggregates are more than just collections of objects; they represent carefully crafted boundaries around a set of entities and value objects. These boundaries are crucial for maintaining data integrity and encapsulating business rules. However, engineers often struggle with understanding the optimal size and scope of an aggregate, leading to either overly large aggregates that become bottlenecks or too many small aggregates that make the system unnecessarily complex. I recommend reading my separate article dedicated to understanding aggregates in DDD, which lays the foundation for the concepts we’ll explore here. In the intricate process of Design Level Event Storming, especially when identifying and defining aggregates for a complex system like a campervan rental service, the foremost step is to ensure the involvement of the right mix of people, and change people. This team should ideally be a blend of domain experts, who bring in-depth knowledge of the campervan rental business, and technical specialists, such as software developers and architects. Their combined insights are crucial in ensuring that the identified aggregates align with both business realities and technical feasibility. Additionally, including individuals with a focus on user experience is invaluable, particularly for aspects of the system that directly interact with customers. Once this diverse and knowledgeable team is assembled, a pivotal initial step is to revisit and reflect upon the insights gained from the Process Level. This stage is crucial as it provides a rich tapestry of information about the business workflows, key events, commands, and the intricate policies that were identified and explored previously. It’s at this juncture that a deep understanding of how the business operates comes to the forefront, offering a nuanced perspective that is essential for the next phase of aggregate identification and design. In Event Storming, the flow often goes from a command (an action initiated) to a domain event (a significant change or result in the system). However, there’s usually an underlying business rule that dictates how and why this transition from command to event happens. This is where blank yellow sticky notes come in. The blank yellow sticky note serves as a placeholder for the business rule that connects the command to the domain event. It represents the decision-making logic or criteria that must be satisfied for the event to occur as a result of the command. When a command and its corresponding domain event are identified, a blank yellow sticky note is placed between them. This signifies that there is a business rule at play, influencing the transition from the command to the event. The blank state of the sticky note invites team members, especially domain experts, to discuss and identify the specific rule or logic. This is a collaborative process where different perspectives help in accurately defining the rule. Through discussion, the team arrives at a clear understanding of the business rules. Participants are asked to fill in these business rules on the yellow sticky notes with comprehensive details about their execution. This involves several key aspects: Preconditions: What must be true before the rule is executed? For instance, before the Rent Campervan command can succeed, a precondition might be that the selected campervan must be available for the chosen dates. Postconditions: What becomes true after the rule is executed? Following the campervan rental, a postcondition would be that the campervan’s status changes to "rented" for the specified period. Invariants: What remains true throughout the execution of the rule? An invariant could be that a customer’s account must be in good standing throughout the rental process. Additional information: Any other clarifications or details that help in understanding what the business rule does Some business rules might be straightforward, but others could lead to extensive discussions. This interaction is a crucial part of the knowledge-sharing process. It allows domain experts to clarify complex business logic and developers to understand how these rules translate into system functionality. These discussions are invaluable for ensuring that everyone has a clear and shared understanding of how the business operates and how the system should support these operations. This process goes with an in-depth analysis of the various events and commands that had emerged in earlier stages. We noticed a distinct pattern: a cluster of activities and decisions consistently revolved around the management of campervans. The technique involves physically moving these similar business rules on top of one another on the board where the Event Storming session is visualized. This action is more than just an organizational step; it’s a method to highlight and analyze the interrelations and potential redundancies among the rules. This consolidation helps in clearly seeing how different rules interact with the same set of data or conditions. It can reveal dependencies or conflicts between rules that might not have been evident when they were considered in isolation. By grouping similar rules, you simplify the overall complexity of the system. It becomes easier to understand and manage the business logic when it’s seen through grouped, related rules rather than as a multitude of individual ones. This process can also uncover opportunities to refine or merge rules, leading to more streamlined and efficient business processes. Moreover, a closer look at the operational challenges and data cohesion associated with campervans solidified our thinking. We realized that managing various aspects related to campervans under a unified system would streamline operations, reducing complexities and enhancing service efficiency. The disparate pieces of information - maintenance schedules, booking calendars, location tracking - all pointed towards a need for an integrated approach. The decision to establish an aggregate was a culmination of these observations and discussions. It was a decision driven not just by operational logic but by the natural convergence of business activities related to campervans. By forming this aggregate, we envisioned a system where all aspects of a campervan’s lifecycle were managed cohesively, ensuring seamless operations and an enhanced customer experience. This approach also brought into focus the need for enforcing consistency across the campervan fleet. By designing an aggregate to encapsulate all aspects related to each vehicle, we ensured that any changes - be it a rental status update or a maintenance check - were consistently reflected across the entire system. This consistency is crucial for maintaining the integrity and reliability of our service. A campervan, for instance, should not be available for booking if it’s scheduled for maintenance. Similarly, the location information of each campervan needs to be accurate and up-to-date to ensure efficient fleet management. Imagine a scenario where a customer books a campervan for a journey from Munich to Paris. Within the aggregate, several pieces of information about each campervan are tracked and managed, including its current location, availability status, and maintenance schedule. When the customer selects a specific campervan for their dates, the aggregate immediately updates the vehicle’s status to "rented" for that period. This update is critical to ensure that the same campervan isn’t available for other customers for the same dates, preventing double rentings. Simultaneously, let’s say this chosen campervan is due for maintenance soon after the customer’s proposed return date. The system, adhering to the rules within the aggregate, flags this campervan for maintenance, ensuring that it does not get rented again before the maintenance is completed. Invariants, or rules that must always hold true, became a cornerstone in the design of the aggregate. These invariants enforce critical business rules and ensure the validity of the system at all times. For example, an invariant in our system ensures that a campervan cannot be simultaneously booked and marked as under maintenance. Such invariants are essential for maintaining data integrity and providing a consistent, reliable service to our customers. Let’s consider a real-life scenario to illustrate this: A family eagerly plans a summer road trip from Munich to the picturesque landscapes of Paris. They find the perfect campervan on our website and proceed to rent it for their adventure. Unseen to them, but crucial for their journey, is the role of our invariant at play. As soon as they select their dates and campervan, the system springs into action. It checks the aggregate, specifically probing for two critical conditions – is the chosen campervan already rented for these dates, and is it due for maintenance? This is where the invariant exerts its influence. It steadfastly ensures that this campervan is neither engaged in another journey nor scheduled for a maintenance check during the requested time. This rule is inflexible, a cornerstone of our commitment to reliability. These invariants, embedded within our aggregate, are more than just lines of code or business policies. They are a promise – a promise of adventure without the unexpected, of journeys that create memories, not mishaps. By ensuring that each campervan is adequately prepped and available for every booking, these rules not only keep our operations smooth but also cement our reputation as a reliable and customer-centric business. In our exploration of the campervan rental business through Event Storming, we’ve identified a multitude of individual events, commands, and policies. However, these elements gain true significance only when they are clustered together as a cohesive unit. This clustering is what forms the essence of the aggregate. It’s a conceptual realization that the isolated pieces - rentals, maintenance schedules, customer interactions - are interdependent and collectively form the core of our service. Without this unification, each element would merely exist in a vacuum, lacking context and impact. The heart of this aggregate, its root, is the campervan itself. The campervan is not just a physical entity but a nexus of various business processes and customer experiences. We selected the campervan as the aggregate root because it is the central element around which all other activities revolve. Whether it’s booking a campervan, preparing it for a customer, or scheduling its maintenance, every action directly relates to the individual campervan. This choice reflects our understanding that the campervan is the linchpin of our business model, directly influencing customer satisfaction and operational efficiency. Alberto Brandolini emphasized the "blank naming" strategy. At the initial phase, rather than rushing to assign specific names or predefined functions to aggregates, the team is encouraged to recognize them in a more open-ended manner. The Naming of Aggregates This step, strategically placed at the end of the session, is more than just a labeling exercise; it is the final act of distilling and synthesizing the insights gained throughout the event. Early in the session, it might be tempting to assign names to these Aggregates. However, this urge is resisted, as premature naming can lead to misconceptions. Names given too early might not fully capture the essence of what an Aggregate represents, as they are based on an initial, incomplete understanding of the system. Therefore, the practice of waiting until the end to name these Aggregates is not just a procedural step; it is a deliberate choice to ensure accuracy and relevance in naming. Thus, the Campervan Aggregate, with the campervan as its root, becomes a powerful tool in our system architecture, encapsulating the complexity of our operations into a manageable and coherent structure. Conclusion As we conclude Part 3 of “Demystifying Event Storming: Design Level, Identifying Aggregates,” we have navigated the intricate process of identifying aggregates, a pivotal aspect in Domain-Driven Design. This journey through the Design Level has illuminated the profound utility of Event Storming in mapping complex systems, highlighting the importance of a collaborative approach and the strategic use of the blank naming strategy. The emergence of the Campervan Aggregate in our rental business model is a testament to the effectiveness of this methodology. It underscores how well-defined aggregates can streamline system design, ensuring alignment with business objectives. The decision to name these aggregates at the end of the session, based on deep insights and understanding, has been crucial in accurately reflecting their roles within the system. Looking ahead, our series will continue to explore the depths of Event Storming. In the next installment, we will delve into identifying bounded contexts, a key concept in Domain-Driven Design that further refines our understanding of complex systems. This next phase will focus on how bounded contexts help in delineating clear boundaries within the system, facilitating better organization and more efficient communication across different parts of the system.
In the dynamic world of online services, the concept of site reliability engineering (SRE) has risen as a pivotal discipline, ensuring that large-scale systems maintain their performance and reliability. Bridging the gap between development and operations, SRE is a set of principles and practices that aims to create scalable and highly reliable software systems. Site Reliability Engineering in Today’s World Site reliability engineering is an engineering discipline devoted to maintaining and improving the reliability, durability, and performance of large-scale web services. Originating from the complex operational challenges faced by large internet companies, SRE incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goal is to create automated solutions for operational aspects such as on-call monitoring, performance tuning, incident response, and capacity planning. Further Reading: Top Open Source Projects for SREs. What Does a Site Reliability Engineer Do? A site reliability engineer operates at the intersection of software engineering and systems engineering. It was a natural evolutionary role for many database administrators with deeper system administration skills once the modernization to the cloud began. The role of the SRE encompasses: Developing software and writing code for service scalability and reliability Ensuring uptime, maintaining services, and minimizing downtime Incident management, including handling system outages and conducting post-mortems Optimizing on-call duties, balancing responsibilities with proactive engineering Capacity planning, which includes predicting future needs and scaling resources accordingly Site Reliability Engineering Principles The core principles of Site Reliability Engineering (SRE) form the foundation upon which its practices and culture are built. One of the key tenets is automation. SRE prioritizes automating repetitive and manual tasks, which not only minimizes the risk of human error but also liberates engineers to focus on more strategic, high-value work. Automation in SRE extends beyond simple task execution; it encompasses the creation of self-healing systems that automatically recover from failures, predictive analytics for capacity planning, and dynamic provisioning of resources. This principle seeks to create a system where operational work is managed efficiently, leaving SRE professionals to concentrate on enhancements and innovations that drive the business forward. Measurement is another cornerstone of SRE. In the spirit of the adage, "You can't improve what you can't measure," SRE implements rigorous quantification of reliability and performance. This includes defining clear service level objectives (SLOs) and service level indicators (SLIs) that provide a detailed view of a system's health and user experience. By consistently measuring these metrics, SREs make data-driven decisions that align technical performance with business goals. Shared ownership is integral to SRE as well. It dissolves the traditional barriers between development and operations, encouraging both teams to take collective responsibility for the software they build and maintain. This collaboration ensures a more holistic approach to problem-solving, with developers gaining more insight into operational issues and operations teams getting involved earlier in the development process. Lastly, a blameless culture is crucial to the SRE ethos. By treating failures as opportunities for improvement rather than reasons for punishment, teams are encouraged to share information openly without fear. This approach leads to a more resilient organization as it promotes a DevOps culture of transparency and continuous learning. When incidents occur, blameless postmortems are conducted, focusing on what happened and how to prevent it in the future, rather than who caused it. This principle not only enhances the team's ability to respond to incidents but also contributes to a positive and productive work environment. Together, these principles guide SRE teams in creating and maintaining reliable, efficient, and continuously improving systems. The Benefits of Site Reliability Engineering Site Reliability Engineering (SRE) not only improves system reliability and uptime but also bridges the gap between development and operations, leading to more efficient and resilient software delivery. By adopting SRE principles, organizations can achieve a balance between innovation and stability, ensuring that their services are both cutting-edge and dependable for their users. Benefits Drawbacks Improved Reliability: Ensures systems are dependable and trustworthy Complexity: Can be difficult to implement in established systems without proper expertise Efficiency: Automation reduces manual labor and speeds up processes. Resource Intensive: Initially requires significant investment in training and tooling Scalability: Provides essential framework for systems to grow without a decrease in performance Balancing Act: Striking the right balance between new features and reliability can be challenging. Innovation: Frees up engineering time for feature development X Site Reliability Engineering vs DevOps Site Reliability Engineering (SRE) and DevOps are two methodologies that, while converging towards the aim of streamlining software development and enhancing system reliability, adopt distinct pathways to realize these goals. DevOps is primarily focused on melding the development and operations disciplines to accelerate the software development lifecycle. This is achieved through the practices of continuous integration and continuous delivery (CI/CD), which ensure that code changes are automatically built, tested, and prepared for a release to production. The heart of DevOps lies in its cultural underpinnings—breaking down silos, fostering cross-functional team collaboration, and promoting a shared responsibility for the software's performance and health. Learn the Difference: DevOps vs. SRE vs. Platform Engineer vs. Cloud Engineer. SRE, in contrast, takes a more structured approach to reliability, providing concrete strategies and a framework to maintain robust systems at scale. It applies a blend of software engineering principles to operational problems, which is why an SRE team's work often includes writing code for system automation, crafting error budgets, and establishing service level objectives (SLOs). While it encapsulates the collaborative spirit of DevOps, SRE specifically zones in on ensuring system reliability and stability, especially in large-scale operations. It operationalizes DevOps by adding a set of specific practices that are oriented towards proactive problem prevention and quick problem resolution, ensuring that the system not only works well under normal conditions but also maintains performance during unexpected surges or failures. Monitoring, Observability, and SRE Monitoring and observability form the foundational pillars of Site Reliability Engineering (SRE). Monitoring is the systematic process of gathering, processing, and interpreting data to gain a comprehensive view of a system's current health. This involves the utilization of various metrics and logs to track the performance and behavior of the system's components. The primary goal of monitoring is to detect anomalies and performance deviations that may indicate underlying issues, allowing for timely interventions. On the other hand, observability extends beyond the scope of monitoring by providing insights into the system's internal workings through its external outputs. It focuses on the ability to infer the internal state of the system based on data like logs, metrics, and traces, without needing to add new code or additional instrumentation. SRE teams leverage observability to understand complex system behaviors, which enables them to preemptively identify potential issues and address them proactively. By integrating these practices, SRE ensures that the system not only remains reliable but also meets the set business objectives, thereby delivering a seamless user experience. Conclusion Site reliability engineering is essential for businesses that depend on providing reliable online services. With its blend of software engineering and systems management, SRE helps to ensure that systems are not just functional, but are also resilient, scalable, and efficient. As organizations increasingly rely on complex systems to conduct their operations, the principles and practices of SRE will become ever more integral to their success. In crafting this analysis, we've touched on the multifaceted role of SRE in modern web services, its core principles, and the tangible benefits it brings to the table. Understanding the distinction between SRE and DevOps clarifies its unique position in the technology landscape, highlighting how essential the discipline is in achieving and maintaining high standards of reliability and performance in today's digital world.
Stefan Wolpers
Agile Coach,
Berlin Product People GmbH
Daniel Stori
Software Development Manager,
AWS
Alireza Rahmani Khalili
Officially Certified Senior Software Engineer, Domain Driven Design Practitioner,
Worksome