Data is at the core of software development. Think of it as information stored in anything from text documents and images to entire software programs, and these bits of information need to be processed, read, analyzed, stored, and transported throughout systems. In this Zone, you'll find resources covering the tools and strategies you need to handle data properly.
CAP Theorem for Distributed System
Using Spring AI With AI/LLMs to Query Relational Databases
Multi-Touch Attribution (MTA) is an advanced approach in digital marketing analytics that assigns credit to each touchpoint a consumer interacts with during their journey towards a conversion. Unlike traditional models that attribute conversion success to a single touchpoint, MTA recognizes the complexity of consumer behavior by analyzing how different channels and interactions contribute to the final outcome. This method is increasingly crucial in a multi-channel marketing landscape as it provides more accurate insights into the effectiveness of various marketing strategies and campaigns. In the technical realm, MTA employs algorithms and statistical methods to distribute credit for conversion across multiple customer interactions, ranging from first exposure to the final conversion action. In this article, we are going to explore some of the traditional and advanced multi-touch attribution models and the algorithms behind them. Figure 1: Attribution model assigns weights to each channel Traditional Models Assume we have a series of n touchpoints leading to a conversion, and let C represent the total conversion value. The contribution value assigned to each touchpoint i will be denoted as Vi. Below are the various traditional attribution models Last-click attribution assigns full credit for a conversion to the final touchpoint before a purchase. While straightforward, its major flaw is the disregard for all preceding customer interactions, potentially undervaluing the importance of early engagement and awareness initiatives in the marketing funnel. First-click attribution credits the initial interaction in the customer's journey with the entire conversion. This approach overlooks the contribution of subsequent touchpoints, often resulting in a skewed understanding of mid-funnel and closing strategies' effectiveness. Linear attribution evenly distributes credit across all touchpoints. However, this model's critical limitation is its failure to acknowledge the varying influence of different interactions, potentially oversimplifying the impact of each marketing effort. Credit is evenly distributed across all touchpoints. Position-based (U-shaped) attribution emphasizes the first and last interactions (usually 40% credit each), with the rest spread across other touchpoints. This model might not accurately capture the significance of mid-funnel activities and can oversimplify complex customer journeys. Most credit goes to the first and last touchpoints, with the rest distributed evenly among the middle touchpoints. W-Shaped Attribution: An extension of the U-shaped model, it also gives additional weight to the mid-funnel touchpoint (typically a lead conversion), along with the first and last touchpoints. These traditional attribution models, while providing basic frameworks for understanding marketing impact, often fall short of accurately reflecting the intricate, multi-faceted nature of modern consumer journeys. They tend to either oversimplify the process or bias certain touchpoints, leading to potentially skewed marketing insights and decisions. As the digital landscape evolves, more sophisticated and nuanced approaches like Multi-Touch Attribution are gaining prominence to address these limitations. Advanced Models Time Decay Attribution Model The Time Decay Attribution Model is a popular method used in marketing analytics to attribute credit for conversions based on the timing of customer touchpoints. This model operates on the principle that touchpoints closer in time to the conversion are more influential than earlier ones. Concept The Time Decay model assigns more credit to marketing interactions that occur closer to the time of conversion. It's based on the rationale that these later interactions are likely more impactful in influencing the customer's final decision. It is particularly useful in long sales cycles where multiple touchpoints occur over an extended period, allowing marketers to weigh recent interactions more heavily. Approach Touchpoint identification: All touchpoints along the customer journey, from the first interaction to the conversion, are identified. Time-based weighting: Each touchpoint is assigned a weight that increases as it gets closer to the conversion event. The weighting typically follows an exponential or logarithmic function, where the increase in credit allocation accelerates as the touchpoint gets closer to the moment of conversion. Credit allocation: The model calculates the attribution by distributing the total conversion value among the touchpoints, based on their assigned weights. A common approach to represent the Time Decay model is through an exponential decay function. If t represents the time of a touchpoint and T is the time of conversion, the weight W assigned to a touchpoint can be expressed as: Where: e is the base of the natural logarithm. λ is a decay rate constant that determines how rapidly the weight of a touchpoint decreases over time. A higher λ means a faster decay. Markov Chain Attribution Models The Markov Chain Model in MTA is a sophisticated method used to evaluate the effectiveness of different marketing touchpoints in a customer's journey. In MTA, the Markov Chain Model treats the customer journey as a sequence of states, corresponding to various touchpoints. The key property of a Markov Chain is that the probability of moving to the next state depends only on the current state, not on the previous states. Each touchpoint in a customer's journey is a state in the Markov Chain. The model analyzes transitions between these states to understand how customers move through the sales funnel and how each touchpoint influences their journey toward conversion. Background Markov Chains were developed by Andrey Markov in the early 20th century. Their adoption in marketing attribution is a relatively recent innovation, leveraging their capacity to model complex, non-linear customer journeys. The use of Markov Chains in marketing attribution became prominent with the rise of multi-channel digital marketing strategies. In an environment where customers interact with multiple marketing touchpoints across different channels before converting, traditional attribution models like last-click or first-click became insufficient. Markov Chain models offered a more dynamic and holistic view. Algorithmic Approach Defining states: Each unique touchpoint, along with the start, conversion, and non-conversion, is defined as a state. Transition probability matrix: Construct a matrix that represents the probabilities of transitioning from one state (touchpoint) to another, based on historical data. Building the chain: Use the transition probabilities to model the customer journey as a Markov Chain. Calculating conversion probabilities: Compute the likelihood of reaching the conversion state from each touchpoint. Assessing touchpoint influence: Analyze the impact of removing individual touchpoints on the overall conversion probability, indicating their contribution to the journey. The table below shows the customer journey of four customers and its conversion factor. Customer Channel Conversion A Email ->House Ads No B Search Ads -> House Ads Yes C House Ads No D Search Ads -> Social Yes The customer journey of the table above can be visualized as a Directed Acyclic Graph (DAG) with probability for each transition as below. Figure 2: Customer Journey DAG Removal Effect An important aspect of Markov chain attribution is how the removal of a given touchpoint from the graph affects the likelihood of conversion. Let’s remove the Email node from the graph above to understand this behavior. Figure 3: DAG with Email node removed By removing the email, the conversion probability was reduced to 41.67% from 50%. Now the removal effect of the channel can be calculated using the formula below: Based on the above formula, the conversion probability of Email can be calculated as: Similarly, the removal effect of other channels can be calculated using the above formula and the share of each channel can be calculated as follows: Channel Conversion Probability Removal Effect Share House Ads 25% 50% 0.273 Search Ads 16.67% 66.67% 0.364 Social 25% 50% 0.273 Email 41.67% 16.67% 0.090 Markov Chain Limitations Markov Chains assume that the next state (or touchpoint) only depends on the current state and not on the history of states. This assumption might not accurately represent marketing scenarios where the effect of a touchpoint could depend on previous interactions. The model often overlooks the influence of one channel on the effectiveness of another, potentially underestimating the synergistic or suppressive effects between different marketing channels. Markov Chain models require comprehensive and granular data on customer interactions across all channels and touchpoints. Shapley Value Attribution Model The Shapley Value Model, originating from cooperative game theory, offers a unique and equitable approach to MTA in marketing. It allocates credit to each touchpoint in a way that fairly represents its contribution to the overall success of a marketing campaign. Its goal is a fair attribution method that considers all possible combinations of touchpoints, ensuring each one gets credit proportional to its impact. Concept and Background Developed by Lloyd Shapley in 1953, the Shapley Value is a solution concept in cooperative game theory. It's designed to fairly distribute the payoff among players who cooperate and contribute differently to the coalition. In the scenario of a cooperative game where multiple players join forces to create coalitions, thereby increasing the chances of a successful outcome (or payoff), the Shapley value offers a method for equitably distributing the payoff among the participants. At its core, the Shapley value calculates the average contribution of each player to the coalitions they participate in. This calculation takes into account the variability in the influence (or worth) each player brings and the order in which they join the coalitions, considering that every sequence of joining has an equal chance of occurring. Therefore, players are compensated based on their contribution across all possible permutations. When applied to marketing analytics, the players in this scenario are the various campaign channels, and the coalitions represent the different ways these channels interact and engage with accounts throughout the customer's journey. Utilizing cooperative game theory and the Shapley value, we can achieve a stable and fair measure of each channel’s influence, allocating credit for sales conversions among them proportionally to their individual contributions to the overall outcome. Algorithmic Approach Enumerate all possible coalitions: List all possible combinations (subsets) of touchpoints that might lead to a conversion. Calculate the payoff of each coalition: Determine the value (e.g., conversion rate) that each subset of touchpoints achieves. Distribute value among touchpoints: For each touchpoint, calculate its contribution across all possible coalitions it's part of, based on the difference it makes to the coalition’s value. The Shapley Value for a touchpoint is calculated using the formula: Where: ϕi(v) is the Shapley Value for touchpoint i. N is the set of all touchpoints. S is a subset of touchpoints excluding i. v(S) is the payoff (value) of the subset S. The sum is taken over all subsets S of N that don't include i. Let’s take the example of touchpoints involved in conversion: N = { Search Ads, Social, Email } Following is the ratio of each Coalition resulted in conversion: Coalition Channels Ratio S1 Email 0.04 S2 Search Ads 0.12 S3 Social 0.08 S4 Email + Search Ads 0.17 S5 Social + Search Ads 0.22 S6 Email + Social 0.11 S7 Search + Social + House Ads 0.26 The payoff or worth of each coalition is determined by the characteristic function. In this example, the worth is represented as the sum of the conversion ratio of each channel in a coalition. To find the payoff value of Coalition S5, use: So the payoff of each coalition can be calculated as shown below: Function Channels Calculation Payoff v(S1) Email S1 0.04 v(S2) Search Ads S2 0.12 v(S3) Social S3 0.08 v(S4) Email + Search Ads S1 + S2 + S4 0.33 v(S5) Social + Search Ads S2 + S3 + S5 0.42 v(S6) Email + Social S1 + S3 + S6 0.23 v(S7) Search + Social + House Ads S1+ S2 + S3 + S4 + S5 + S6 1.0 Understanding the value contributed by each coalition allows for the calculation of Shapley values. These are determined by averaging the incremental contribution (marginal contribution) of each channel across all possible sequences of coalition formation. Essentially, the Shapley value method offers a systematic approach to apportion the total value generated by the grand coalition (the collective payoff) among the three channels. This approach ensures a fair distribution based on the unique contribution each channel makes to the overall outcome. Indeed, the motivation behind the formulation of Shapley Values lies in accounting for the specific timing at which each channel or touchpoint joins a coalition. This timing is crucial because it affects the player's marginal contribution to the overall outcome. In essence, the Shapley Value method is about calculating each channel's incremental contribution, averaged across all potential sequences in which the channel or touchpoint could join the group. If the channel or touchpoint comes first, its individual payoff is considered a marginal contribution, if it comes later in the order, its subset of coalition including the prior touch points in the sequence minus the one without the current channel or touch point would be considered as its marginal contribution for the coalition. The Shapley value is the average expected marginal contribution of one channel or touchpoint after all possible combinations have been considered. In the scenario you described, this involves simulating every possible order in which the touchpoints (Email, Social, and Search Ads) could engage with the customer. For each of these sequences, you would assess the additional value (marginal payoff) brought by each touchpoint when it's added to the sequence. Then, by averaging these incremental values across all sequences, you obtain the Shapley Value for each touchpoint. This method ensures a fair and comprehensive evaluation of each touchpoint’s contribution by considering every possible way they could interact in the customer's journey, thereby reflecting their true value in the grand scheme of the marketing strategy. Let’s consider the grand coalition S7 and find the Shapley value to distribute the payoff to each channel based on the arrival order of each channel. Arrival Order Email Marginal Contribution Social Marginal Contribution Search Ads Marginal Contribution Email + Social + Search v(S1) = 0.04 v(S6) – v(S1) = 0.19 v(S7 ) – v(S6) = 0.77 Email + Search + Social v(S1) = 0.04 v(S7 ) – v(S4) = 0.67 v(S4) – v(S1) = 0.29 Social + Email + Search v(S6) – v(S3) = 0.15 v(S3) = 0.08 v(S7 ) – v(S6) = 0.77 Social + Search + Email V(S7) – v(S5) = 0.58 v(S3) = 0.08 v(S5) – v(S2) = 0.30 Search + Email + Social v(S4) – v(S2) = 0.11 v(S7 ) – v(S4) = 0.67 v(S2) = 0.12 Search + Social + Email v(S7) – v(S5) = 0.58 v(S5) – v(S2) = 0.30 v(S2) = 0.12 Shapley Value orAverage MarginalContribution 0.25 0.332 0.395 Shapley Value Limitations Calculating the Shapley value can be computationally expensive, especially with a large number of players (or marketing channels). The model requires the evaluation of every possible combination of players, which grows exponentially with the number of players. When direct information about specific coalitions is missing, you can use available data to estimate their values. This can be done through statistical modeling, machine learning techniques, or even simpler heuristic methods. Bayesian Probability Models The Bayesian Attribution Model is an advanced approach within the realm of MTA that leverages Bayesian statistics to infer the impact of various marketing touchpoints on consumer behavior and conversion rates. This model is particularly notable for its ability to handle uncertainty and integrate prior knowledge into its analytical framework. Concept and Functionality Bayesian Attribution is rooted in Bayesian probability, which updates the probability estimate for a hypothesis as more evidence or information becomes available. This approach is particularly useful in situations where data is incomplete or uncertain. In the context of MTA, the Bayesian model assesses the probability of conversion given the exposure to different marketing touchpoints. It updates these probabilities as new data becomes available, making it a dynamic and continuously evolving model. Algorithmic Approach Defining prior probabilities: Start with initial assumptions or "priors" about the effectiveness of different touchpoints. These priors can be based on historical data or expert opinion. In case of no prior data, uniform probability distribution or other statistical methods can be used. Let's assume we have prior beliefs (based on historical data or expert opinions) about the effectiveness of each channel Collecting data: Gather data on customer interactions with various touchpoints along their journey toward a conversion. Updating probabilities: As new data comes in, the model updates the probabilities using Bayes' Theorem. This theorem combines prior probabilities with new evidence to produce updated (posterior) probabilities. Continuous learning: The model keeps updating its understanding of touchpoint effectiveness as more interaction data is collected, refining its insights over time. Bayesian Attribution uses Bayes' Theorem, which in its basic form is: Where: P(A∣B) is the posterior probability (e.g., the probability of a conversion given exposure to a specific touchpoint). P(B∣A) is the likelihood (e.g., the likelihood of observing the data given the touchpoint's effectiveness). P(A) is the prior probability (initial assumption about the touchpoint's effectiveness). P(B) is the marginal probability of the data. Let's explore Bayesian Multi-Touch Attribution in the customer journey example involving Display Ads, Search Ads, Social Media, and Email Ads. Let's assume we have prior beliefs (based on historical data or expert opinions) about the effectiveness of each channel. Channel Prior Probabilities Display Ads (A) 0.35 Search Ads (B) 0.30 Social (C) 0.25 Email (D) 0.10 Calculate the likelihood based on the data points. For example, if 70% of conversions involve Search Ads in the journey, then the likelihood of Search Ads is 0.7. Channel Likelihood Display Ads (A) 0.75 Search Ads (B) 0.70 Social (C) 0.50 Email (D) 0.35 Marginal Probability = Sum(Likelihood of channel i X Prior Probability of Channel i); for every channel Marginal Probability based on above data = (0.75*0.35) + (0.7*0.3) + (0.5*0.25) + (0.35*0.1) = 0.633 Using the Bayes Theorem the posterior probability can be calculated: Channel Posterior Probability Display Ads (A) (0.75*035/0.633) = 0.42 Search Ads (B) (0.7*0.3)/0.633 = 0.33 Social (C) (0.5*0.25)/0.633 = 0.20 Email (D) (0.35*0.10)/0.633 = 0.05 Limitations The model's accuracy is partly dependent on the prior probabilities assigned to the effectiveness of different channels. These priors can be subjective and might skew the model if not accurately set. Machine Learning Attribution Models Machine learning models include regression models, decision trees, random forests, and neural networks. These models analyze complex interactions among touchpoints and can handle various types of data, including unstructured data. Capable of handling large datasets and finding non-linear relationships. They adapt as new data becomes available, providing continuously refined insights. Machine Learning (ML) algorithms have significantly advanced the field of MTA by introducing sophisticated methods to analyze complex customer journeys. These algorithms can decipher intricate patterns in large datasets, enabling marketers to understand and attribute the impact of various touchpoints more accurately. Concept and Application ML algorithms in MTA use data-driven approaches to model and predict the impact of each marketing touchpoint on the customer's path to conversion. They go beyond traditional rule-based attribution models by learning from data to identify how different touchpoints contribute to conversions. ML algorithms are employed to analyze the customer journey across multiple channels and touchpoints. They can handle vast and varied datasets, accounting for non-linear relationships and interactions among touchpoints. Types of Machine Learning Algorithms Supervised learning: Concept: Supervised learning involves training a model on a labeled dataset, where the input (features) and the desired output (labels) are known. The model learns to map inputs to outputs. Common Algorithms: Regression models, decision trees, random forests, support vector machines, and neural networks. 2. Unsupervised learning: Concept: Unsupervised learning finds patterns or structures in a dataset without pre-existing labels. The algorithms discover inherent groupings or associations in the data. Examples: Clustering algorithms like K-means, hierarchical clustering, and principal component analysis (PCA). Conclusion Multi-Touch Attribution (MTA) has emerged as a crucial tool in modern marketing analytics, offering a sophisticated way to understand and quantify the impact of various touchpoints in a customer's journey. By moving beyond the limitations of traditional single-touch attribution models, MTA provides a more nuanced and comprehensive view of the effectiveness of different marketing channels and strategies. Its ability to distribute credit for conversions more accurately across multiple interactions helps marketers optimize their campaigns, allocate budgets efficiently, and tailor customer experiences more effectively. However, the complexity and data-intensive nature of MTA models, along with the need for advanced analytical skills, mean that their implementation can be challenging. Despite these challenges, the insights gained from MTA are invaluable for businesses looking to navigate the complex, multi-channel landscape of modern digital marketing. As technology advances and data becomes more accessible, MTA is likely to become even more integral to effective marketing strategy development and evaluation. References Kakalejčík, L., Bucko, J., Resende, P.A. and Ferencova, M., 2018. Multichannel marketing attribution using Markov chains. Journal of Applied Management and Investments, 7(1), pp.49-60. Zhao, K., Mahboobi, S.H. and Bagheri, S.R., 2018. Shapley value methods for attribution modeling in online advertising. arXiv preprint arXiv:1804.05327. Sinha, R., Arbour, D. and Puli, A.M., 2022. Bayesian Modeling of Marketing Attribution. arXiv preprint arXiv:2205.15965. Berman, R., 2018. Beyond the last touch: Attribution in online advertising. Marketing Science, 37(5), pp.771-792. Romero Leguina, J., Cuevas Rumín, Á. and Cuevas Rumín, R., 2020. Digital marketing attribution: Understanding the user path. Electronics, 9(11), p.1822. One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values Data-Driven Marketing Attribution Markov Chain Attribution Modeling [Complete Guide] Markov Chain Attribution Modeling [Complete Guide]
People initially became interested in blockchain several years ago after learning about it as a decentralized digital ledger. It supports transparency because no one can change information stored on it once added. People can also watch transactions as they happen, further enhancing visibility. But how does blockchain support the integrity of cloud-stored data? 3 Ways Blockchain Supports the Integrity of Cloud-Stored Data 1. Protecting and Facilitating the Sharing of Medical Records Technological advancements have undoubtedly improved the ease of sharing medical records between providers. When patients go to new healthcare facilities, all involved parties can easily see those individuals’ histories, treatments, test results, and more. Such records keep everyone updated about what’s happened to patients, which significantly reduces the likelihood of redundancies and confusion that could extend a health management timeline. Cloud computing has also accelerated information-sharing efforts within healthcare and other industries. It allows medical professionals to access and collaborate through scalable platforms. Many healthcare workers also appreciate how they can access cloud apps from anywhere. That convenience supports physicians who must travel for continuing medical education events, travel nurses, surgeons who split their time between multiple hospitals, and others who often work from numerous locations. However, despite these cloud computing benefits, a security-related downside is platforms use a centralized infrastructure to allow record sharing across users. That characteristic leaves cloud tools open to data breaches. In one case, researchers proposed addressing this shortcoming with a blockchain architecture to authenticate users and enable opportunities for sharing medical records securely. The group prioritized blockchain due to its immutability while seeking to create a system that allowed patients and their providers to share and store medical records securely. The researchers also wanted to design something that was not at risk of data loss or other failures. The researchers implemented so-called “special recognition keys” to identify medical-related specifics, such as identifying doctors, patients, and hospitals. When testing their system, some metrics studied included the time to complete a transaction and how well the communication-related attributes performed. The outcomes suggested the researchers’ approach worked far better than existing solutions. 2. Improving Access Control Data breaches can be costly, catastrophic events. Although there’s no single solution for preventing them, people can make meaningful progress by focusing on access control. One of the most convenient things about the cloud is it allows all authorized users to access content regardless of their location. However, as the number of people engaging with a cloud platform increases, so does the risk of compromised credentials that could allow hackers to enter networks and wreak havoc. Many corporate leaders have prioritized cloud-first strategies. That approach can strengthen cybersecurity because service providers have numerous security features to supplement internal measures. Additionally, cloud-based backup capabilities facilitate faster data recovery if cyberattacks occur. However, research suggests some access control practices used by cloud administrators have significant shortcomings that could make cyberattacks more likely. For example, one study about access management for cloud platforms found 49% of administrators store passwords in a spreadsheet. That’s a huge security risk for many reasons, but it also highlights the need for better password hygiene practices. Fortunately, the blockchain is well-positioned to solve this problem. In one example, researchers developed a blockchain system that uses attribute-based encryption technology to improve how cloud users access content. The setup also contains an audit contract that dynamically manages who can use the cloud and when. The team’s creation built a fine-grained and searchable system that maintained access control by strengthening cloud security and getting the desired results without excessive computing power. Results also showed this system increased storage capacity. When the group performed a security analysis on their blockchain creation, they found it stopped chosen-plaintext attacks and cybersecurity breaches based on guessed keywords. A theoretical examination and associated experiments suggested this tool worked better from a computing power and storage efficiency perspective than comparable alternatives. 3. Curbing Emerging Technologies’ Potential Threats Even as new technologies show tremendous progress and excite people about the future, some individuals specifically investigate how they could harm others through technological advancements. Developments associated with ChatGPT and other generative AI tools are excellent examples. Indeed, these chatbots can save people time by assisting them with tasks such as idea generation or outline creation. However, because these tools create believable-sounding paragraphs in seconds, some cybercriminals use generative artificial intelligence (genAI) chatbots to write phishing emails much faster than before. It’s easy to imagine the ramifications of a cybercriminal who writes a convincing phishing message and uses it to access someone’s cloud-stored information. ChatGPT runs on a cloud platform built by OpenAI, which created the chatbot. A lesser-known issue affecting data integrity is OpenAI uses interactions with the tool to train future versions of the algorithms. People can opt out of having their conversations become part of the training, but many people haven’t or don’t know the process for doing it. As workers eagerly tested ChatGPT and similar tools, some committed potential security breaches without realizing it. Consider if a web developer enters a proprietary code string into ChatGPT and asks the tool for help debugging it. That seemingly minor decision could result in sensitive information becoming part of training data and no longer being carefully protected by the developer’s employer. Some leaders quickly established rules for appropriate usage or banned generative AI tools to address these threats. A February 2024 study also showed some workers kept entering sensitive information when using ChatGPT despite knowing the associated risks. It’s still unclear how the blockchain will support data integrity for people using cloud-based generative AI tools, but many professionals are upbeat about the potential. Conclusion: Using Blockchain for Cloud Data Protection Entities ranging from government agencies to e-commerce stores use cloud platforms daily. These options are incredibly convenient because they eliminate geographical barriers and allow people to use them through an active internet connection anywhere in the world. However, many cloud tools store sensitive data, such as health records or payment details. Since cloud platforms hold such a wealth of information, hackers will likely continue targeting them. Although most cloud providers have built-in security features, cybercriminals continually seek ways to circumvent such protections. The examples here show why the blockchain is an excellent candidate for much-needed additional safeguards.
Data has become essential to our daily lives; people transmit data from one point to another for several reasons, which include facilitating communication and sharing of information between individuals and organizations; two of the common fundamental ways people transmit data are TCP and UDP. Data transmission effectively enhances point-to-point, point-to-multipoint, and multipoint-to-multipoint communication for devices. The debate of TCP vs. UDP has been on for a long time and will continue; the choice of which you choose to use depends on what you want to achieve. What Is TCP? Transmission Control Protocol (TCP) is a reliable, connection-oriented protocol that you can use to transmit data over a network. It ensures that data is delivered error-free, in sequence, and without loss or duplication. TCP transfers your data over the internet from your device to a web server and works perfectly for applications that require high reliability, such as email, file transfer, chatting with friends on Skype, watching online videos, and web browsing. Due to being connection-based, TCP reliably enhances and maintains a connection between the receiver and sender while transferring data. TCP has become the most popular network because it ensures Data arrives completely intact. How Does TCP Work? TCP usually sends small packets of data through the internet that get reassembled when they arrive at the receiver’s destination. In practice, TCP executes data transmission by assigning each data packet a sequence number and unique identifier. The sequence number and unique identifier enable the receiver to identify which packet was received and which will arrive subsequently. The receiver sends an acknowledgment to the sender on the sequential and orderly arrival of the data packet, which prompts the sender to send another packet. If the recipient does not acknowledge receiving the packet, it may indicate that a packet is lost or sent in the wrong sequence, and a resend is necessary. Sending data in sequence eliminates congestion and enables flow control, making it easy to correct errors. The back-and-forth communication between the two parties necessitates a long period to establish a connection and exchange data; the time you spend enables the data you send over TCP to reach the recipient safely. What Is UDP? UDP (User Datagram Protocol) is a connectionless protocol (doesn’t establish a prior connection between two parties) that provides an unreliable but fast (especially where speed is crucial, such as in gaming and streaming) and more straightforward way of transferring data over a network, containing a packet of data, the source, destination addresses, and the data payload. A UDP user can experience packet loss due to network congestion, errors, latency, bandwidth issues, or other factors. How Does UDP Work? UDP can transmit data as TCP without the mandatory unique identifiers and sequential numbers. It sends data in a stream and only has a substantiation to corroborate data uncorrupted arrival. UDP doesn’t care about lost packets or fix errors; it’s highly error-prone but transmits data much faster than TCP. The inability to correct mistakes in identifiers and number sequence or the difficulty in setting up a firewall allowing some UDP communications and blocking the rest doesn’t mean that UDP connections are left entirely unprotected. While it’s easier to secure TCP than UDP, UDP users can deploy the virtual private network (VPN) and proxies for particular apps or enable a protective tunnel connecting the remote user and the organization’s internal network. What Are the Fundamental Differences Between TCP and UDP? Fundamentally, TCP and UDP are the most commonly used protocols for transmitting data over a network. The fundamental difference between TCP and UDP lies in how they enhance a connection and ensure reliable data transfer. TCP is a connection-oriented protocol that establishes a reliable connection between two devices and guarantees the delivery of all data is in the correct order without any loss or corruption. However, UDP is a connectionless protocol that does not establish a connection and doesn’t guarantee reliable data transfer. Instead, UDP simply sends a packet of data to the recipient without any acknowledgment or error-checking mechanism. Concerning the TCP vs UDP in relationship to the VPN debate, OpenVPN works best on a UDP port, although configuration enables it to run on any port. You can view a simple comparison of TCP and UDP below: UDP TCP Reliability Lower High Speed High Lower Transfer method Delivers packets in a stream Delivery of packets in a sequence Error detection and correction No Yes Congestion control No Yes Acknowledgment Only the checksum Yes Does SSH Use TCP or UDP? SSH (Secure Shell) uses TCP (Transmission Control Protocol) protocol. It is a reliable, connection-oriented protocol that provides secure and encrypted communication between two devices over a network. SSH uses TCP port 22 for communication and establishes a reliable connection between the client and the server. Is HTTP TCP or UDP? HTTP (Hypertext Transfer Protocol) is a TCP-based protocol. It uses TCP port 80 for communication and establishes a reliable connection between the client and the server to transfer hypertext. Does DNS Work on Both TCP and UDP? Yes, DNS (Domain Name System) can work on TCP or UDP; using TCP or UDP depends on the application’s requirements. Fundamentally, UDP is used for DNS queries because it is faster and more effective. However, you can use TCP for zone transfers and other functions requiring more reliability and error correction. Merits and Demerits of TCP The advantages of using TCP include: The ability for TCP to operate independently of systems enables greater interoperability across systems and devices. TCP enhances error-free data transmission and ensures that the data it sends reaches its destination intact. Depending on the capacities of the receiver, the TCP optimizer enables alteration of the speed at which it transmits data. TCP will confirm that data has safely arrived at its destination and will initiate a re-transfer if the first transmission fails. As we have seen with most technologies, there is usually a flip side to every advantage; the disadvantages of TCP include: You need quite a lot of bandwidth for TCP, making it slower than UDP. Losing any data during transmission, such as an image or video, renders loading other information impossible with TCP. TCP works poorly with local area networks (LAN) or personal area networks. Advantages of UDP The merits of using UDP include: UDP considerably reduces end-to-end delay by sending smaller packets with less overhead. UDP will deliver data even if some packets are missing, unlike TCP, where packet loss disrupts data transmission. UDP has broadcast and multicast functionality that enables simultaneous sending of a single transmission to multiple receivers. UDP transmission doesn’t require a lot of bandwidth, making it much faster and more efficient than other options, like TCP. Setbacks to the UDP data transmission technology include: The inability to check if a data packet reached its destination successfully. A UDP user cannot verify if any packet got lost before arriving at the required destination during data transmission. Where data transmission is router-enabled, the priority will be on a TCP rather than a UDP packet. There is no orderliness in the arrival of packets because UDP doesn’t send data sequentially. Conclusion The choice of which to use between TCP and UDP depends to a large extent on what you want to do; both UDP and TCP divide your data into data packets (smaller units), including the sender’s and recipient’s internet protocol addresses, the data you are transmitting, configurations, and the trailer. For fast and constant data transmission, you may go for UDP, but for reliability and ensuring you don’t lose any data as you transmit, you may need to go for TCP. Static uses, such as emailing, web browsing, and file transfer, should go with OpenVPN via TCP, while UDP is a good option for gaming, streaming, or VoIP services.
Microsoft's SQL Server is a powerful RDBMS that is extensively utilized in diverse industries for the purposes of data storage, retrieval, and analysis. The objective of this article is to assist novices in comprehending SQL Server from fundamental principles to advanced techniques, employing real-world illustrations derived from nProbe data. nProbe is a well-known network traffic monitoring tool that offers comprehensive insights into network traffic patterns. Getting Started With SQL Server 1. Introduction to SQL Server SQL Server provides a comprehensive database management platform that integrates advanced analytics, robust security features, and extensive reporting capabilities. It offers support for a wide range of data types and functions, enabling efficient data management and analysis. 2. Installation Begin by installing SQL Server. Microsoft offers different editions, including Express, Standard, and Enterprise, to cater to varying needs. The Express edition is free and suitable for learning and small applications. Here is the step-by-step guide to install the SQL server. 3. Basic SQL Operations Learn the fundamentals of SQL, including creating databases, tables, and writing basic queries: Create database: `CREATE DATABASE TrafficData;` Create table: Define a table structure to store nProbe data: MS SQL CREATE TABLE NetworkTraffic ( ID INT PRIMARY KEY, SourceIP VARCHAR(15), DestinationIP VARCHAR(15), Packets INT, Bytes BIGINT, Timestamp DATETIME ); Intermediate SQL Techniques 4. Data Manipulation Inserting Data To insert data into the `NetworkTraffic` table, you might collect information from various sources, such as network sensors or logs. MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.1', '192.168.1.1', 150, 2048, '2023-10-01T14:30:00'); Batch insert to minimize the impact on database performance: MS SQL INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES ('10.0.0.2', '192.168.1.2', 50, 1024, '2023-10-01T15:00:00'), ('10.0.0.3', '192.168.1.3', 100, 1536, '2023-10-01T15:05:00'), ('10.0.0.4', '192.168.1.4', 200, 4096, '2023-10-01T15:10:00'); Updating Data You may need to update records as new data becomes available or corrections are necessary. For instance, updating the byte count for a particular traffic record: MS SQL UPDATE NetworkTraffic SET Bytes = 3072 WHERE ID = 1; Update multiple fields at once: MS SQL UPDATE NetworkTraffic SET Packets = 180, Bytes = 3072 WHERE SourceIP = '10.0.0.1' AND Timestamp = '2023-10-01T14:30:00'; Deleting Data Removing data is straightforward but should be handled with caution to avoid accidental data loss. MS SQL DELETE FROM NetworkTraffic WHERE Timestamp < '2023-01-01'; Conditional delete based on network traffic analysis: MS SQL DELETE FROM NetworkTraffic WHERE Bytes < 500 AND Timestamp BETWEEN '2023-01-01' AND '2023-06-01'; Querying Data Simple Queries: Retrieve basic information from your data set. MS SQL SELECT FROM NetworkTraffic WHERE SourceIP = '10.0.0.1'; Select specific columns: MS SQL SELECT SourceIP, DestinationIP, Bytes FROM NetworkTraffic WHERE Bytes > 2000; Aggregate Functions Useful for summarizing or analyzing large data sets. MS SQL SELECT AVG(Bytes), MAX(Bytes), MIN(Bytes) FROM NetworkTraffic WHERE Timestamp > '2023-01-01'; Grouping data for more detailed analysis: MS SQL SELECT SourceIP, AVG(Bytes) AS AvgBytes FROM NetworkTraffic GROUP BY SourceIP HAVING AVG(Bytes) > 1500; Join Operations In scenarios where you have multiple tables, joins are essential. Assume another table `IPDetails` that stores additional information about each IP. MS SQL SELECT n.SourceIP, n.DestinationIP, n.Bytes, i.Location FROM NetworkTraffic n JOIN IPDetails i ON n.SourceIP = i.IPAddress WHERE n.Bytes > 1000; Complex Queries Combining multiple SQL operations to extract in-depth insights. MS SQL SELECT SourceIP, SUM(Bytes) AS TotalBytes FROM NetworkTraffic WHERE Timestamp BETWEEN '2023-01-01' AND '2023-02-01' GROUP BY SourceIP ORDER BY TotalBytes DESC; Advanced SQL Server Features 5. Indexing for Performance Optimizing SQL Server performance through indexing and leveraging stored procedures for automation is critical for managing large databases efficiently. Here’s an in-depth look at both topics, with practical examples, particularly focusing on enhancing operations within a network traffic database like the one collected from nProbe. Why Indexing Matters Indexing is a strategy to speed up the retrieval of records from a database by reducing the number of disk accesses required when a query is processed. It is especially vital in databases with large volumes of data, where search operations can become increasingly slow. Types of Indexes Clustered indexes: Change the way records are stored in the database as they sort and store the data rows in the table based on their key values. Tables can have only one clustered index. Non-clustered indexes: Do not alter the physical order of the data, but create a logical ordering of the data rows and use pointers to physical rows; each table can have multiple non-clustered indexes. Example: Creating an Index on Network Traffic Data Suppose you frequently query the `NetworkTraffic` table to fetch records based on `SourceIP` and `Timestamp`. You can create a non-clustered index to speed up these queries: MS SQL CREATE NONCLUSTERED INDEX idx_networktraffic_sourceip ON NetworkTraffic (SourceIP, Timestamp); This index would particularly improve performance for queries that look up records by `SourceIP` and filter on `Timestamp`, as the index helps locate data quickly without scanning the entire table. Below are additional instructions on utilizing indexing effectively. 6. Stored Procedures and Automation Benefits of Using Stored Procedures Stored procedures help in encapsulating SQL code for reuse and automating routine operations. They enhance security, reduce network traffic, and improve performance by minimizing the amount of information sent to the server. Example: Creating a Stored Procedure Imagine you often need to insert new records into the `NetworkTraffic` table. A stored procedure that encapsulates the insert operation can simplify the addition of new records: MS SQL CREATE PROCEDURE AddNetworkTraffic @SourceIP VARCHAR(15), @DestinationIP VARCHAR(15), @Packets INT, @Bytes BIGINT, @Timestamp DATETIME AS BEGIN INSERT INTO NetworkTraffic (SourceIP, DestinationIP, Packets, Bytes, Timestamp) VALUES (@SourceIP, @DestinationIP, @Packets, @Bytes, @Timestamp); END; Using the Stored Procedure To insert a new record, instead of writing a full insert query, you simply execute the stored procedure: MS SQL EXEC AddNetworkTraffic @SourceIP = '192.168.1.1', @DestinationIP = '10.0.0.1', @Packets = 100, @Bytes = 2048, @Timestamp = '2024-04-12T14:30:00'; Automation Example: Scheduled Tasks SQL Server Agent can be used to schedule the execution of stored procedures. For instance, you might want to run a procedure that cleans up old records every night: MS SQL CREATE PROCEDURE CleanupOldRecords AS BEGIN DELETE FROM NetworkTraffic WHERE Timestamp < DATEADD(month, -1, GETDATE()); END; You can schedule this procedure to run automatically at midnight every day using SQL Server Agent, ensuring that your database does not retain outdated records beyond a certain period. By implementing proper indexing strategies and utilizing stored procedures, you can significantly enhance the performance and maintainability of your SQL Server databases. These practices are particularly beneficial in environments where data volumes are large and efficiency is paramount, such as in managing network traffic data for IFC systems. 7. Performance Tuning and Optimization Performance tuning and optimization in SQL Server are critical aspects that involve a systematic review of database and system settings to improve the efficiency of your operations. Proper tuning not only enhances the speed and responsiveness of your database but also helps in managing resources more effectively, leading to cost savings and improved user satisfaction. Key Areas for Performance Tuning and Optimization 1. Query Optimization Optimize queries: The first step in performance tuning is to ensure that the queries are as efficient as possible. This includes selecting the appropriate columns, avoiding unnecessary calculations, and using joins effectively. Query profiling: SQL Server provides tools like SQL Server Profiler and Query Store that help identify slow-running queries and bottlenecks in your SQL statements. Example: Here’s how you can use the Query Store to find performance issues: MS SQL SELECT TOP 10 qt.query_sql_text, rs.avg_duration FROM sys.query_store_query_text AS qt JOIN sys.query_store_plan AS qp ON qt.query_text_id = qp.query_text_id JOIN sys.query_store_runtime_stats AS rs ON qp.plan_id = rs.plan_id ORDER BY rs.avg_duration DESC; 2. Index Management Review and adjust indexes: Regularly reviewing the usage and effectiveness of indexes is vital. Unused indexes should be dropped, and missing indexes should be added where significant performance gains can be made. Index maintenance: Rebuilding and reorganizing indexes can help in maintaining performance, especially in databases with heavy write operations. Example: Rebuild an index using T-SQL: MS SQL ALTER INDEX ALL ON dbo.YourTable REBUILD WITH (FILLFACTOR = 90, SORT_IN_TEMPDB = ON, STATISTICS_NORECOMPUTE = OFF); 3. Database Configuration and Maintenance Database settings: Adjust database settings such as recovery model, file configuration, and buffer management to optimize performance. Routine maintenance: Implement regular maintenance plans that include updating statistics, checking database integrity, and cleaning up old data. Example: Set up a maintenance plan in SQL Server Management Studio (SSMS) using the Maintenance Plan Wizard. 4. Hardware and Resource Optimization Hardware upgrades: Sometimes, the best way to achieve performance gains is through hardware upgrades, such as increasing memory, adding faster disks, or upgrading CPUs. Resource allocation: Ensure that the SQL Server has enough memory and CPU resources allocated, particularly in environments where the server hosts multiple applications. Example: Configure maximum server memory: MS SQL EXEC sp_configure 'max server memory', 4096; RECONFIGURE; 5. Monitoring and Alerts System monitoring: Continuous monitoring of system performance metrics is crucial. Tools like System Monitor (PerfMon) and Dynamic Management Views (DMVs) in SQL Server provide real-time data about system health. Alerts setup: Configure alerts for critical conditions, such as low disk space, high CPU usage, or blocking issues, to ensure that timely actions are taken. Example: Set up an alert in SQL Server Agent: MS SQL USE msdb ; GO EXEC dbo.sp_add_alert @name = N'High CPU Alert', @message_id = 0, @severity = 0, @enabled = 1, @delay_between_responses = 0, @include_event_description_in = 1, @notification_message = N'SQL Server CPU usage is high.', @performance_condition = N'SQLServer:SQL Statistics|Batch Requests/sec|_Total|>|1000', @job_id = N'00000000-1111-2222-3333-444444444444'; GO Performance tuning and optimization is an ongoing process, requiring regular adjustments and monitoring. By systematically addressing these key areas, you can ensure that your SQL Server environment is running efficiently, effectively supporting your organizational needs. Conclusion Mastering SQL Server is a journey that evolves with practice and experience. Starting from basic operations to leveraging advanced features, SQL Server provides a powerful toolset for managing and analyzing data. As your skills progress, you can handle larger datasets like those from nProbe, extracting valuable insights and improving your network's performance and security. For those looking to dive deeper, Microsoft offers extensive documentation and a community rich with resources to explore more complex SQL Server capabilities. Useful References nProbe SQL Server SQL server performance tuning
Time Series Forecasting: Single Variate In today's data-driven landscape, businesses across industries are continually seeking ways to gain a competitive edge. One of the most powerful tools at their disposal is time series forecasting, a technique that allows organizations to predict future trends based on historical data. From finance to healthcare, time series forecasting is transforming how companies strategize and make decisions. What Is Time Series Forecasting? Time series forecasting involves analyzing data points collected or recorded at specific time intervals. Unlike static data, time series data is chronological, often exhibiting patterns like trends and seasonality. Forecasting methods leverage these patterns to predict future values, providing insights that are invaluable for planning and strategy. Some Use Cases of Time Series Forecasting Sales Forecasting Retail businesses use time series forecasting to predict future sales. By analyzing past sales data, they can anticipate demand, optimize inventory, and plan marketing campaigns. Stock Market Analysis Financial analysts employ time series forecasting to predict stock prices and market trends. This helps investors make informed decisions about buying or selling assets. Weather Prediction Meteorologists use time series forecasting to predict weather patterns. This data is crucial for agriculture, disaster preparedness, and daily planning. Healthcare Resource Planning Hospitals and clinics use forecasting to anticipate patient influx. This helps in managing resources such as staff, beds, and medical supplies. Energy Consumption Forecasting Utility companies leverage time series forecasting to predict energy demand. This enables efficient management of power grids and resource allocation. Forecasting Techniques Forecasting techniques include: Statistical Analysis Machine Learning Algos like ARIMA Neural Networking (RNN): LSTM LSTM LSTM networks are specialized neural networks that handle sequences of data. Unlike regular feedforward neural networks, LSTMs have loops that allow information to persist, making them ideal for tasks where context over time is crucial. Each LSTM cell consists of three parts: an input gate, a forget gate, and an output gate, which regulate the flow of information, allowing the network to selectively retain or forget information over time. More details can be found in the video "Long Short-Term Memory (LSTM), Clearly Explained." In this tutorial, we are going to focus on the single variate LSTM analysis. Soon I will be publishing an implementation approach for multivariate analysis. Main Code Importing Libraries Python import tensorflow as tf import os import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.preprocessing import MinMaxScaler from keras.preprocessing.sequence import TimeseriesGenerator as TSG from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM from tensorflow.keras.models import load_model from tensorflow.keras.callbacks import ModelCheckpoint import tensorflow.keras.optimizers as optimizers import seaborn as sns import statsmodels.api as sm Loading the data from the CSV file, this is beverage sales data. Then column update changes Unnamed: 0 to date. Convert the entries in the date column from strings (or any other format they might be in) to datetime objects. The range function generates a sequence of numbers starting from 1 and ending at the total number of entries in the DataFrame (inclusive). This sequence is then assigned to a new column called sequence. Python import pandas as pd # Load data from a CSV file into a DataFrame dfc = pd.read_csv('sales_beverages.csv') # Rename the column from 'Unnamed: 0' to 'date' dfc = dfc.rename(columns={"Unnamed: 0": "date"}) # Convert the 'date' column from a string type to a datetime type to facilitate date manipulation dfc["date"] = pd.to_datetime(dfc["date"]) # Add a new column called 'sequence' which is a sequence of integers from 1 to the number of rows in the DataFrame # This sequence helps in identifying the row number or providing a simple ordinal index dfc["sequence"] = range(1, len(dfc) + 1) # Display the modified DataFrame dfc DATE Sales_Beverages Sequence 0 2016-01-02 250510.0 1 1 2016-01-03 299177.0 2 2 2016-01-04 217525.0 3 3 2016-01-05 187069.0 4 4 2016-01-06 170360.0 5 ... ... ... ... 586 2017-08-11 189111.0 587 587 2017-08-12 182318.0 588 588 2017-08-13 202354.0 589 589 2017-08-14 174832.0 590 590 2017-08-15 170773.0 591 591 rows × 3 columns Exploring the Data Python print('Number of Samples = {}'.format(dfc.shape[0])) print('Training X Shape = {}'.format(dfc.shape)) print('Index of data set:\n', dfc.columns) print(dfc.info()) print('\nMissing values of data set:\n', dfc.isnull().sum()) print('\nNull values of data set:\n', dfc.isna().sum()) # Generate a complete range of dates from the min to max all_dates = pd.date_range(start=dfc['date'].min(), end=dfc['date'].max(), freq='D') # Find missing dates by checking which dates in 'all_dates' are not in 'df['date']' missing_dates = all_dates.difference(dfc['date']) # Display the missing dates print("Missing dates are ", missing_dates) Number of Samples = 591 Training X Shape = (591, 3) Index of data set: Index(['date', 'sales_BEVERAGES', 'sequence'], dtype='object') <class 'pandas.core.frame.DataFrame'> RangeIndex: 591 entries, 0 to 590 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 591 non-null datetime64[ns] 1 sales_BEVERAGES 591 non-null float64 2 sequence 591 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(1) memory usage: 14.0 KB None Missing values of data set: date 0 sales_BEVERAGES 0 sequence 0 dtype: int64 Null values of data set: date 0 sales_BEVERAGES 0 sequence 0 dtype: int64 Missing dates are DatetimeIndex(['2016-12-25'], dtype='datetime64[ns]', freq=None) Break the date columns into individual units like, year, month, date, and day of the week to understand if there is any pattern in the sales data. year: The year part of the date month: The month part of the date day: The day of the month day_of_week: The name of the day of the week (e.g., Monday, Tuesday) day_of_week_num: The numerical representation of the day of the week (0 for Monday through 6 for Sunday) Python # Extract year, month, day, and day of the week dfc['year'] = dfc['date'].dt.year dfc['month'] = dfc['date'].dt.month dfc['day'] = dfc['date'].dt.day dfc['day_of_week'] = dfc['date'].dt.day_name() dfc['day_of_week_num'] = dfc['date'].dt.dayofweek A correlation matrix is computed for selected columns (year, month, day, day_of_week_num, and sales_BEVERAGES). This matrix measures the linear relationships between these variables, which can help in understanding how different date components influence beverage sales. Python # Calculate correlation matrix correlation_matrix = dfc[['year', 'month', 'day', 'day_of_week_num', 'sales_BEVERAGES']].corr() # Print the correlation matrix #print(correlation_matrix) # Set up the matplotlib figure plt.figure(figsize=(10, 8)) # Draw the heatmap sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', cbar=True) # Add a title and format it plt.title('Heatmap of Correlation Between Date Components and Sales') # Show the plot plt.show() The code above shows a strong correlation between the sales and days of the week and between the sales and the years. Let's draw the corresponding graphs to verify the variations. Data Visualization Day of the Week vs. Sales Python plt.figure(figsize=(10, 6)) sns.barplot(x='day_of_week', y='sales_BEVERAGES', data=dfc) plt.title('Day of Week vs. Sales') plt.xlabel('Day of Week (0=Monday, 6=Sunday)') plt.ylabel('Sales') plt.grid(True) plt.show() Year vs. Sales Python plt.figure(figsize=(10, 6)) sns.lineplot(x='year', y='sales_BEVERAGES', data=dfc, marker='o') plt.title('Year vs. Sales') plt.xlabel('Year') plt.ylabel('Sales') plt.grid(True) plt.show() There is a clear indication that the sales are high on weekends and lesser on Thursdays. Also, yearly sales are increasing every year. It is a linear trend which means there are not many variations. Month-Year vs. Sales Let's quickly verify the variation with the year-month combination as well. Python dfc['month_year'] = dfc['date'].dt.to_period('M') plt.figure(figsize=(16, 8)) sns.barplot(x='month_year', y='sales_BEVERAGES', data=dfc) plt.title('Month vs. Sales') plt.xlabel('Month-Year') plt.ylabel('Sales') plt.grid(True) plt.show() Due to limited data, it is not very clear, but it seems like sales are higher in the month of December and January. Average Sales Calculation Per Year Python #Evaluate the average sales of the year on monthly bais. a = dfc[dfc['year'].isin([2016,2017])].groupby(["year", "month"]).sales_BEVERAGES.mean().reset_index() plt.figure(figsize=(10, 6)) sns.lineplot(data=a, x='month', y='sales_BEVERAGES', hue='year', marker='o') # Enhance the plot with titles and labels plt.title('Average Sales for 2016 and 2017') plt.xlabel('Month') plt.ylabel('Average Sales') plt.legend(title='Year') plt.grid(True) # Show the plot plt.show() ACF vs PACF (Not Required for LSTM as Such) This step is generally used with the ARIMA model, but still, it will give you some good visibility around the window size used later. Python fig, ax = plt.subplots(1,2,figsize=(15,5)) sm.graphics.tsa.plot_acf(dfc.sales_BEVERAGES, lags=365, ax=ax[0], title = "AUTOCORRELATION\n") sm.graphics.tsa.plot_pacf(dfc.sales_BEVERAGES, lags=180, ax=ax[1], title = "PARTIAL AUTOCORRELATION\n") Take out the subset of the data for the trend analysis. Only fetch out the sales_BEVERAGAES, as we are going to perform a single variate analysis in this exercise: Python df1=dfc[["date",'sales_BEVERAGES']] df1.head() date sales_BEVERAGES 0 2016-01-02 250510.0 1 2016-01-03 299177.0 2 2016-01-04 217525.0 3 2016-01-05 187069.0 4 2016-01-06 170360.0 This line plots the sales_BEVERAGES column from df1, starting from the second element (index 1) to the end. The exclusion of the first data point ([1:]) is used to avoid a specific outlier. This would filter df1 to only include rows where sales_BEVERAGES is greater than 20,000. Again a step is required to take out the outliers. Python plt.plot(df1['sales_BEVERAGES'][1:]) df1=df1[df1['sales_BEVERAGES']>20000] df2=df1['sales_BEVERAGES'][1:] df2.shape (589,) MinMaxScaler MinMaxScaler is from the sklearn.preprocessing library to scale the data in the df2 series. This is a preprocessing step in data analysis and machine learning tasks, especially when working with neural networks, as it helps to normalize the data within a specified range, typically [0, 1]. X scaled = X max −X min /X−X min It is used here to improve the convergence process. Many machine learning algorithms that use gradient descent as an optimization technique (e.g., linear regression, logistic regression, neural networks) converge faster when features are scaled. If one feature’s range is orders of magnitude larger than others, it can dominate the objective function and make the model unable to learn effectively from other features. Python scaler=MinMaxScaler() scaler.fit(df2.values.reshape(-1, 1))# Convert the PeriodIndex to DateTimeIndex if necessary df2=scaler.transform(df2.values.reshape(-1, 1)) Python plt.plot(df2) I have not used the function below, but still, I have included it with a short explanation. I used the time-series generator instead of it, though both of them will perform the same task. You can use any of them. The function is designed to convert a Pandas DataFrame into input-output pairs (X, y) for use in machine learning models, particularly those involving time series data, such as LSTMs. window_size: An integer indicating the number of time steps in each input sequence, defaulted to 5 df_as_np converts the Pandas DataFrame to NumPy array to facilitate numerical operations and slicing. Two lists will be created: X for storing input sequences and y for storing corresponding labels (output). It iterates over the NumPy array, starting from the first index up to the length of the array minus the window_size. This ensures that each input sequence and its corresponding output value can be captured without going out of bounds. For each iteration, it extracts a sequence of length window_size from the array and appends it to X. This sequence serves as one input sample. The output value (label) corresponding to each input sequence is the next value immediately following the sequence in the DataFrame. This value is appended to y. Example: X=[1,2,3,4,5], y=6 X=[2,3,4,5,6], y=7 X=[3,4,5,6,7], y=8 X=[4,5,6,7,8], y=9 and so on... Python def df_to_X_y(df, window_size=5): df_as_np = df.to_numpy() X = [] y = [] for i in range(len(df_as_np)-window_size): row = [[a] for a in df_as_np[i:i+window_size]] X.append(row) label = df_as_np[i+window_size] y.append(label) return np.array(X), np.array(y) Python WINDOW_SIZE = 5 X1, y1 = df_to_X_y(df2, WINDOW_SIZE) X1.shape, y1.shape print(y1) [143636. 152225. 227854. 263121. 157869. 136315. 132266. 120609. 141955. 220308. 251345. 158492. 136240. 143371. 115821. 135214. 204449. 231483. 141976. 128256. 129324. 113870. 137022. 209541. 245481. 182638. 154284. 149974. 134005. 167256. 207438. 152830. 133559. 157846. 154782. 132974. 144742. 190061. 219933. 166667. 150444. 142628. 124212. 146081. 203285. 234842. 153189. 134845. 137272. 120695. 137555. 208705. 229672. 158195. 179419. 170183. 135577. 152201. 227024. 245308. 155266. 132163. 137198. 119723. 141062. 201038. 223273. 144170. 135828. 147195. 121907. 143712. 202664. 216151. 148126. 130755. 148247. 149854. 149515. 182196. 195375. 143196. 130183. 129972. 129134. 178237. 247315. 280881. 168081. 146023. 145034. 122792. 149302. 209669. 236767. 146607. 134193. 138348. 115020. 136320. 186935. 308788. 303298. 301533. 249845. 213186. 191154. 233084. 238503. 148627. 135431. 136526. 114193. 146007. 232805. 282785. 181088. 161856. 154805. 135208. 155813. 233769. 193033. 167064. 142775. 146886. 125988. 138176. 206787. 247562. 159437. 135697. 133039. 120632. 140732. 198856. 235966. 146066. 118786. 119655. 118074. 173865. 169401. 210425. 154183. 189942. 144778. 136640. 136752. 200698. 237485. 143265. 122148. 123561. 103888. 120510. 177120. 209344. 145511. 122071. 130428. 117386. 138623. 201641. 188682. 156605. 144562. 130519. 110900. 127196. 186097. 211047. 143453. 120127. 120697. 111342. 163624. 221451. 240162. 171926. 141837. 141899. 117203. 137729. 186086. 205290. 148417. 127538. 120720. 108521. 139563. 191821. 206438. 148214. 123942. 128434. 115017. 129281. 178923. 188675. 148783. 124377. 132795. 107270. 133460. 191957. 216431. 180546. 152668. 145874. 128160. 148293. 193330. 206605. 157126. 137263. 138205. 135983. 164500. 166578. 180725. 158646. 147799. 147254. 127986. 150082. 187625. 211220. 155457. 142435. 141334. 124207. 134789. 176165. 197233. 147156. 133625. 145155. 147069. 181079. 238510. 261398. 183848. 164550. 154897. 123746. 138299. 206418. 235684. 145080. 122882. 121120. 116264. 143598. 200090. 235321. 141236. 132262. 129414. 110130. 136138. 192610. 221098. 143488. 122181. 123595. 112182. 142867. 251375. 279121. 172823. 146150. 146410. 120057. 143269. 202566. 247109. 153350. 125318. 129236. 111697. 138234. 197333. 258559. 151406. 129897. 127212. 124603. 144526. 192343. 241561. 142098. 124323. 128716. 120153. 136370. 194747. 232250. 148589. 182070. 215033. 180293. 193535. 208685. 270422. 187162. 166081. 164618. 129184. 150597. 222661. 291398. 165265. 160177. 181322. 138887. 167311. 220970. 278158. 172392. 151843. 157465. 133102. 170648. 223057. 263835. 177635. 140124. 164748. 178953. 185360. 255126. 297968. 182323. 207703. 178510. 140546. 163758. 209125. 260947. 168443. 148518. 159319. 146315. 169151. 226210. 270298. 196844. 194254. 198153. 198308. 226894. 236331. 227027. 177554. 192477. 186177. 240693. 243518. 4008. 335235. 243422. 211239. 175975. 189393. 261820. 297197. 186203. 171274. 164531. 145461. 174206. 252034. 301353. 199820. 184129. 176227. 144535. 162192. 264633. 299512. 191891. 167718. 160219. 125294. 156006. 226837. 257357. 155191. 165171. 192241. 155016. 173306. 256450. 265030. 171537. 156490. 161764. 132978. 164050. 220696. 255490. 169350. 129329. 147599. 137081. 156814. 246049. 213733. 167601. 157364. 148629. 149845. 182391. 230937. 168924. 165020. 212594. 204522. 180400. 186437. 257990. 276118. 169456. 157163. 150271. 147502. 177393. 245596. 288397. 178705. 163684. 173812. 164418. 188890. 259101. 297490. 192579. 172289. 167424. 153886. 182043. 257097. 284616. 188293. 164975. 177997. 136349. 188660. 336063. 264738. 188774. 184424. 181898. 153189. 171158. 228604. 262298. 170621. 163715. 171716. 177420. 179465. 216599. 233163. 175805. 158029. 149701. 144429. 169675. 236707. 285611. 175184. 161949. 164587. 143934. 180469. 250534. 249008. 303807. 200529. 188754. 149629. 161279. 233814. 287104. 166843. 145619. 147196. 135028. 154026. 244193. 206986. 179114. 169098. 165675. 133381. 161718. 227900. 280849. 169143. 151437. 153706. 136779. 212870. 212127. 254132. 171962. 158403. 174304. 166771. 204402. 278488. 339352. 214773. 184706. 181931. 152212. 178063. 242234. 311184. 176821. 158624. 158633. 142765. 181072. 250214. 245520. 179095. 173553. 154251. 125467. 160086. 218486. 263497. 166889. 140339. 143776. 136268. 170346. 271027. 297619. 199766. 173857. 170074. 150965. 178964. 232222. 262375. 179826. 162466. 158262. 149968. 181719. 246513. 283097. 193199. 170182. 163361. 163747. 183117. 229380. 245466. 188077. 160403. 156176. 141686. 191922. 249085. 274030. 195504. 215546. 204566. 156806. 187818. 225481. 250784. 179419. 160636. 153010. 156449. 189111. 182318. 202354. 174832. 170773.] TimeSeriesGenerator The TimeseriesGenerator utility from the Keras library in TensorFlow is a powerful tool for generating batches of temporal data. This utility is particularly useful when working with sequence prediction problems involving time series data. This aims to simplify the creation of a TimeseriesGenerator instance for a given dataset. Params data: The dataset used to generate the input sequences targets: The dataset containing the targets (or labels) for each input sequence; In many time series forecasting tasks, the targets are the same as the data because you are trying to predict the next value in the same series. length: The number of time steps in each input sequence (specified by n_input). batch_size: The number of sequences to return per batch (set to 1 here, which means the generator yields one input-target pair per batch) Advantages Using a TimeseriesGenerator offers several advantages: Memory efficiency: It generates data batches on the fly and hence, is more memory-efficient than pre-generating and storing all possible sequences. Ease of use: It integrates seamlessly with Keras models, especially when using model training routines like fit_generator. Flexibility: It can handle varying lengths of input sequences and can easily adapt to different forecasting horizons. Python def ts_generator(dataset,n_input): generator=TSG(dataset,dataset,length=n_input,batch_size=1) return generator Python #Number of steps to use for predicting the next step WINDOW_SIZE = 30 #This defines the number of features, in our case it is one and it should be sames as the count of neurons in the Dense Layer n_features=1 generator=ts_generator(df2,WINDOW_SIZE) The code snippet provided iterates over a TimeseriesGenerator object, collecting and aggregating all batches into two large NumPy arrays: X_val for inputs and y_val for targets. Python X_val,y_val=generator[0] for i in range(len(generator)): X2, Y2 = generator[i] print("X:", X2) #print("Y:", type(Y)) X_val = np.vstack((X_val, X2)) y_val = np.vstack((y_val, Y2)) X_val=X_val[1:] y_val=y_val[1:] X_val=X_val.reshape(X_val.shape[0],WINDOW_SIZE,n_features) y_val=y_val.flatten() print(X_val.shape) print(y_val) Split this dataset into training, validation, and testing sets based on percentages of the dataset's total length. Python #l_percent is set to 85%, marking the cutoff for the training set. #h_percent is set to 90%, marking the end of the validation set and the beginning of the test set. l_percent=0.85 h_percent=0.90 #l_cnt is the index at 85% of df2, used to separate the training set from the validation set. #h_cnt is the index at 90% of df2, used to separate the validation set from the testing set. l_cnt=round(l_percent * len(df2)) h_cnt=round(h_percent * len(df2)) #Splits for dataset creation val_sales,val_target=X_val[l_cnt:h_cnt],y_val[l_cnt:h_cnt] train_sales,train_target=X_val[:l_cnt],y_val[:l_cnt] test_sales,test_traget=X_val[h_cnt:],y_val[h_cnt:] print(val_sales.shape,val_target.shape,train_sales.shape,train_target.shape,test_sales.shape,test_traget.shape) (30, 30, 1) (30,) (502, 30, 1) (502,) (29, 30, 1) (29,) Setting up a Deep Learning Model Using Keras (TensorFlow Backend) for a Time Series Forecasting Task The code below sets up a deep learning model using Keras (TensorFlow backend) for a time series forecasting task, integrating callbacks for better training management and defining an LSTM-based neural network. Callback Setup EarlyStopping: Stops training when a monitored metric has stopped improving after a specified number of epochs (patience=50); This helps in avoiding overfitting and saves computational resources. ReduceLROnPlateau: Reduces the learning rate when a metric has stopped improving, which can lead to finer tuning of models. It decreases the learning rate by a factor of 0.25 after the performance plateaus for 25 epochs, with the minimum learning rate set to 1e-9 (0.000000001). ModelCheckpoint: It saves the model after every epoch but only if it's the best so far (in terms of the loss on the validation set). It saves only the weights to a directory named model/, which helps in both saving space and potentially avoiding issues when model architecture changes but the training script does not. Layer Setup Layer1: The first LSTM layer has 128 units and returns sequences. This means it will return the full sequence to the next layer rather than just the output of the last timestep. This setup is often used when stacking LSTM layers. It uses ReLU activation and includes dropout and recurrent dropout of 0.2 to combat overfitting. Layer2: The second LSTM layer has 64 units and does not return sequences, indicating it's the final LSTM layer and only returns output from the last timestep. Similar to the first LSTM, it uses ReLU activation with dropout and recurrent dropout settings. Layer3: A dense layer with 64 units acts as a fully connected neural network layer following the LSTM layers to interpret the features extracted from the sequences. Layer4: The final dense layer with a single unit is typical for regression tasks in time series, where you predict a single continuous value. Normally, the ReLU function is used with LSTM layers, but I have seen better results with tanh in the prelim analysis so I included this. Python from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau callbacks = [ EarlyStopping(patience=20, verbose=1), ReduceLROnPlateau(factor=0.25, patience=10, min_lr=0.000000001, verbose=1), ModelCheckpoint('model/', verbose=1, save_best_only=True, save_weights_only=True) ] model=Sequential() model.add(LSTM(128,activation='tanh',dropout=0.2, recurrent_dropout=0.2,return_sequences=True,input_shape=(WINDOW_SIZE,n_features))) model.add(LSTM(64, activation= 'tanh', dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(64)) model.add(Dense(n_features)) model.summary() Model: "sequential_6" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_16 (LSTM) (None, 30, 128) 66560 lstm_17 (LSTM) (None, 64) 49408 dense_15 (Dense) (None, 64) 4160 dense_16 (Dense) (None, 1) 65 ================================================================= Total params: 120193 (469.50 KB) Trainable params: 120193 (469.50 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________ Compile a Keras model with specific settings for the optimizer, loss function, and evaluation metrics. Optimizer optimizers.Adam(lr=.000001): This specifies the Adam optimizer with a learning rate (lr) of 0.000001. Adam is an adaptive learning rate optimization algorithm that has become the default optimizer for many types of neural networks because it combines the best properties of the AdaGrad and RMSProp algorithms to handle sparse gradients on noisy problems. Learning rate: Setting the learning rate to a very small value, like 0.000001, makes the model training take smaller steps to update weights, which can lead to very slow convergence. This low value is used when you need to fine-tune a model or when training a model where larger steps might cause the training process to overshoot the minima. Loss Function loss='mse': This sets the loss function to Mean Squared Error (MSE), which is commonly used for regression tasks. MSE computes the average squared difference between the estimated values and the actual value, making it sensitive to outliers as it squares the errors. Python # Compile the model #cp1 = ModelCheckpoint('model/', save_best_only=True) model.compile(optimizer=optimizers.Adam(lr=.000001), loss= 'mse', metrics = ['mean_squared_error']) It specifies how the training should be conducted, including the datasets to be used, the number of training epochs, and any callbacks that should be applied during the training process. epochs=200: The number of times the model will work through the entire training dataset Python model.fit(train_sales, train_target, validation_data=(val_sales, val_target) ,epochs=200, callbacks=callbacks) Epoch 1/200 16/16 [==============================] - 1s 72ms/step - loss: 0.0055 - mean_squared_error: 0.0055 - val_loss: 0.0110 - val_mean_squared_error: 0.0110 Epoch 2/200 16/16 [==============================] - 2s 98ms/step - loss: 0.0063 - mean_squared_error: 0.0063 - val_loss: 0.0067 - val_mean_squared_error: 0.0067 Epoch 3/200 16/16 [==============================] - 1s 71ms/step - loss: 0.0062 - mean_squared_error: 0.0062 - val_loss: 0.0119 - val_mean_squared_error: 0.0119 Epoch 4/200 16/16 [==============================] - 1s 71ms/step - loss: 0.0058 - mean_squared_error: 0.0058 - val_loss: 0.0097 - val_mean_squared_error: 0.0097 training_loss_per_epoch: This retrieves the training loss for each epoch from the model's history object. The training loss is a measure of how well the model fits the training data, decreasing over time as the model learns. validation_loss_per_epoch: Similarly, this retrieves the validation loss for each epoch. Validation loss measures how well the model performs on a new, unseen dataset (validation dataset), which helps to monitor for overfitting. Overfitting: If your training loss continues to decrease, but your validation loss begins to increase, this may indicate that the model is overfitting to the training data. Underfitting: If both training and validation losses remain high, this might suggest that the model is underfitting and not learning adequately from the training data. Early stopping: By examining these curves, you can also make decisions about using early stopping to halt training at the optimal point before the model overfits. Python training_loss_per_epoch=model.history.history['loss'] validation_loss_per_epoch=model.history.history['val_loss'] plt.plot(range(len(training_loss_per_epoch)),training_loss_per_epoch) plt.plot(range(len(validation_loss_per_epoch)),validation_loss_per_epoch) load_model is a function from Keras that allows you to load a complete model saved in TensorFlow's Keras format. This includes not only the model's architecture but also its learned weights and its training configuration (loss, optimizer). Python from tensorflow.keras.models import load_model model1=load_model('model/') Fetching out the dates from the original DataFrame to set it in the predictions for visibility. Since we are using the window size of 30, the first output/target/prediction will be generated after window_size. Accordingly, manipulate the dates for all three datasets. The below code is designed to extract specific date ranges from a DataFrame to align them with corresponding training, validation, and test datasets. This is particularly useful when you want to track or analyze results over time or relate them to specific events or changes reflected by dates. Python date_df=df1[df1['sales_BEVERAGES']>20000] date_df.count() ####Fetching the dates from the df1 for the train dataset train_date=date_df['date'][WINDOW_SIZE - 1:l_cnt + WINDOW_SIZE - 1] print(train_date.count()) ####Fetching the dates from the df1 for the val dataset val_date=date_df['date'][l_cnt + WINDOW_SIZE - 1:h_cnt + WINDOW_SIZE - 1:] print(val_date.count()) u_date=h_cnt + 1 + WINDOW_SIZE -1 test_date=df1['date'][h_cnt + WINDOW_SIZE - 1: ] test_date.count() 502 30 29 Training Data Actual Value vs Predicted Value Python #This function is used to generate predictions from your pre-trained model on the train_sales dataset. train_predictions = model1.predict(train_sales).flatten() # Applies the inverse transformation to the scaled predictions to bring them back to their original scale. train_pred=scaler.inverse_transform(train_predictions.reshape(-1, 1)) t=scaler.inverse_transform(train_target.reshape(-1, 1)) #print(train_pred.shape) #Creating a dataframe with Actual and predictions train_results = pd.DataFrame(data={'Train Predictions':train_pred.flatten(), 'Actuals':t.flatten(),'dt':train_date }) train_results.tail(20) 16/16 [==============================] - 0s 14ms/step (502, 1) Train Predictions Actuals dt 512 195323.593750 171962.000000 2017-05-29 513 164753.343750 158403.000000 2017-05-30 514 155985.328125 174304.000000 2017-05-31 515 153953.828125 166771.000000 2017-06-01 516 184015.109375 204402.000000 2017-06-02 517 246616.375000 278488.000000 2017-06-03 518 251735.953125 339352.000000 2017-06-04 519 187089.109375 214773.000000 2017-06-05 520 169009.390625 184706.000000 2017-06-06 521 160138.390625 181931.000000 2017-06-07 522 158093.562500 152212.000000 2017-06-08 523 186708.203125 178063.015625 2017-06-09 524 254521.234375 242234.000000 2017-06-10 525 263513.468750 311184.000000 2017-06-11 526 191338.093750 176820.984375 2017-06-12 527 168676.562500 158624.000000 2017-06-13 528 158633.203125 158633.000000 2017-06-14 529 153251.765625 142765.000000 2017-06-15 530 180730.171875 181071.984375 2017-06-16 531 251409.359375 250214.000000 2017-06-17 Below is a graphical representation of actual and predicted values of the training data for the first 100 pointers: Python plt.plot(train_results['dt'][:100],train_results['Train Predictions'][:100],label='pred') plt.plot(train_results['dt'][:100],train_results['Actuals'][:100],label='Actual') plt.legend() plt.xticks(rotation=45) plt.plot(figsize=(18,10)) plt.show() Validation Data Actual Value vs Predicted Value Python ##This function is used to generate predictions from your pre-trained model on the validation dataset. val_predictions = model1.predict(val_sales).flatten() # Applies the inverse transformation to the scaled predictions to bring them back to their original scale. val_pred=scaler.inverse_transform(val_predictions.reshape(-1, 1)) v=scaler.inverse_transform(val_target.reshape(-1, 1)) print(val_pred.shape) #Creating a dataframe with Actual and predictions val_results = pd.DataFrame(data={'Val Predictions':val_pred.flatten(), 'Actuals':v.flatten(),'dt':val_date }) val_results.head() 1/1 [==============================] - 0s 62ms/step (30, 1) Val Predictions Actuals dt 532 265612.906250 245519.984375 2017-06-18 533 186157.468750 179094.984375 2017-06-19 534 167559.578125 173553.000000 2017-06-20 535 158167.000000 154251.000000 2017-06-21 536 155162.000000 125467.000000 2017-06-22 Below is a graphical representation of actual and predicted values of the validation data: Python plt.plot(val_results['dt'],val_results['Val Predictions'],label='pred') plt.plot(val_results['dt'],val_results['Actuals'],label='Actual') plt.legend() plt.xticks(rotation=45) plt.plot(figsize=(18,10)) plt.show() Test Data Actual Value vs Predicted Value Python #This function is used to generate predictions from your pre-trained model on the test dataset. test_predictions = model1.predict(test_sales).flatten() # Applies the inverse transformation to the scaled predictions to bring them back to their original scale. test_pred=scaler.inverse_transform(test_predictions.reshape(-1, 1)) te=scaler.inverse_transform(test_traget.reshape(-1, 1)) print(test_pred.shape) #Creating a dataframe with Actual and predictions test_results = pd.DataFrame(data={'Test Predictions':test_pred.flatten(), 'Actuals':te.flatten(),'dt':test_date }) test_results.head() 1/1 [==============================] - 0s 35ms/step (29, 1) Test Predictions Actuals dt 562 166612.140625 170182.000000 2017-07-18 563 158095.812500 163361.000000 2017-07-19 564 153619.515625 163747.000000 2017-07-20 565 181217.421875 183117.015625 2017-07-21 566 246784.828125 229380.000000 2017-07-22 Python plt.plot(test_results['dt'],test_results['Test Predictions'],label='pred') plt.plot(test_results['dt'],test_results['Actuals'],label='Actual') plt.legend() plt.xticks(rotation=45) plt.plot(figsize=(18,10)) plt.show()
In an era where instant access to data is not just a luxury but a necessity, distributed caching has emerged as a pivotal technology in optimizing application performance. With the exponential growth of data and the demand for real-time processing, traditional methods of data storage and retrieval are proving inadequate. This is where distributed caching comes into play, offering a scalable, efficient, and faster way of handling data across various networked resources. Understanding Distributed Caching What Is Distributed Caching? Distributed caching refers to a method where information is stored across multiple servers, typically spread across various geographical locations. This approach ensures that data is closer to the user, reducing access time significantly compared to centralized databases. The primary goal of distributed caching is to enhance speed and reduce the load on primary data stores, thereby improving application performance and user experience. Key Components Cache store: At its core, the distributed cache relies on the cache store, where data is kept in-memory across multiple nodes. This arrangement ensures swift data retrieval and resilience to node failures. Cache engine: This engine orchestrates the operations of storing and retrieving data. It manages data partitioning for balanced distribution across nodes and load balancing to maintain performance during varying traffic conditions. Cache invalidation mechanism: A critical aspect that keeps the cache data consistent with the source database. Techniques such as time-to-live (TTL), write-through, and write-behind caching are used to ensure timely updates and data accuracy. Replication and failover processes: These processes provide high availability. They enable the cache system to maintain continuous operation, even in the event of node failures or network issues, by replicating data and providing backup nodes. Security and access control: Integral to protecting the cached data, these mechanisms safeguard against unauthorized access and ensure the integrity and confidentiality of data within the cache. Why Distributed Caching? Distributed caching is a game-changer in the realm of modern applications, offering distinct advantages that ensure efficient, scalable, and reliable software solutions. Speed and performance: Think of distributed caching as having express checkout lanes in a grocery store. Just as these lanes speed up the shopping experience, distributed caching accelerates data retrieval by storing frequently accessed data in memory. This results in noticeably faster and more responsive applications, especially important for dynamic platforms like e-commerce sites, real-time analytics tools, and interactive online games. Scaling with ease: As your application grows and attracts more users, it's like a store becoming more popular. You need more checkout lanes (or in this case, cache nodes) to handle the increased traffic. Distributed caching makes adding these extra lanes simple, maintaining smooth performance no matter how busy things get. Always up, always available: Imagine if one express lane closes unexpectedly – in a well-designed store, this isn’t a big deal because there are several others open. Similarly, distributed caching replicates data across various nodes. So, if one node goes down, the others take over without any disruption, ensuring your application remains up and running at all times. Saving on costs: Finally, using distributed caching is like smartly managing your store’s resources. It reduces the load on your main databases (akin to not overstaffing every lane) and, as a result, lowers operational costs. This efficient use of resources means your application does more with less, optimizing performance without needing excessive investment in infrastructure. How Distributed Caching Works Imagine you’re in a large library with lots of books (data). Every time you need a book, you must ask the librarian (the main database), who then searches through the entire library to find it. This process can be slow, especially if many people are asking for books at the same time. Now, enter distributed caching. Creating a mini-library (cache modes): In our library, we set up several small bookshelves (cache nodes) around the room. These mini-libraries store copies of the most popular books (frequently accessed data). So, when you want one of these books, you just grab it from the closest bookshelf, which is much faster than waiting for the librarian. Keeping the mini-libraries updated (cache invalidation): To ensure that the mini-libraries have the latest versions of the books, we have a system. Whenever a new edition comes out, or a book is updated, the librarian makes sure that these changes are reflected in the copies stored on the mini bookshelves. This way, you always get the most current information. Expanding the library (scalability): As more people come to the library, we can easily add more mini bookshelves or put more copies of popular books on existing shelves. This is like scaling the distributed cache — we can add more cache nodes or increase their capacity, ensuring everyone gets their books quickly, even when the library is crowded. Always open (high availability): What if one of the mini bookshelves is out of order (a node fails)? Well, there are other mini bookshelves with the same books, so you can still get what you need. This is how distributed caching ensures that data is always available, even if one part of the system goes down. In essence, distributed caching works by creating multiple quick-access points for frequently needed data, making it much faster to retrieve. It’s like having speedy express lanes in a large library, ensuring that you get your book quickly, the library runs efficiently, and everybody leaves happy. Caching Strategies Distributed caching strategies are like different methods used in a busy restaurant to ensure customers get their meals quickly and efficiently. Here’s how these strategies work in a simplified manner: Cache-aside (lazy loading): Imagine a waiter who only prepares a dish when a customer orders it. Once cooked, he keeps a copy in the kitchen for any future orders. In caching, this is like loading data into the cache only when it’s requested. It ensures that only necessary data is cached, but the first request might be slower as the data is not preloaded. Write-through caching: This is like a chef who prepares a new dish and immediately stores its recipe in a quick-reference guide. Whenever that dish is ordered, the chef can quickly recreate it using the guide. In caching, data is saved in the cache and the database simultaneously. This method ensures data consistency but might be slower for write operations. Write-around caching: Consider this as a variation of the write-through method. Here, when a new dish is created, the recipe isn’t immediately put into the quick-reference guide. It’s added only when it’s ordered again. In caching, data is written directly to the database and only written to the cache if it's requested again. This reduces the cache being filled with infrequently used data but might make the first read slower. Write-back caching: Imagine the chef writes down new recipes in the quick-reference guide first and updates the main recipe book later when there’s more time. In caching, data is first written to the cache and then, after some delay, written to the database. This speeds up write operations but carries a risk if the cache fails before the data is saved to the database. Each of these strategies has its pros and cons, much like different techniques in a restaurant kitchen. The choice depends on what’s more important for the application – speed, data freshness, or consistency. It's all about finding the right balance to serve up the data just the way it's needed! Consistency Models Understanding distributed caching consistency models can be simplified by comparing them to different methods of updating news on various bulletin boards across a college campus. Each bulletin board represents a cache node, and the news is the data you're caching. Strong consistency: This is like having an instant update on all bulletin boards as soon as a new piece of news comes in. Every time you check any board, you're guaranteed to see the latest news. In distributed caching, strong consistency ensures that all nodes show the latest data immediately after it's updated. It's great for accuracy but can be slower because you have to wait for all boards to be updated before continuing. Eventual consistency: Imagine that new news is first posted on the main bulletin board and then, over time, copied to other boards around the campus. If you check a board immediately after an update, you might not see the latest news, but give it a little time, and all boards will show the same information. Eventual consistency in distributed caching means that all nodes will eventually hold the same data, but there might be a short delay. It’s faster but allows for a brief period where different nodes might show slightly outdated information. Weak consistency: This is like having updates made to different bulletin boards at different times without a strict schedule. If you check different boards, you might find varying versions of the news. In weak consistency for distributed caching, there's no guarantee that all nodes will be updated at the same time, or ever fully synchronized. This model is the fastest, as it doesn't wait for updates to propagate to all nodes, but it's less reliable for getting the latest data. Read-through and write-through caching: These methods can be thought of as always checking or updating the main news board (the central database) when getting or posting news. In read-through caching, every time you read data, it checks with the main database to ensure it's up-to-date. In write-through caching, every time you update data, it updates the main database first before the bulletin boards. These methods ensure consistency between the cache and the central database but can be slower due to the constant checks or updates. Each of these models offers a different balance between ensuring data is up-to-date across all nodes and the speed at which data can be accessed or updated. The choice depends on the specific needs and priorities of your application. Use Cases E-Commerce Platforms Normal caching: Imagine a small boutique with a single counter for popular items. This helps a bit, as customers can quickly grab what they frequently buy. But when there's a big sale, the counter gets overcrowded, and people wait longer. Distributed caching: Now think of a large department store with multiple counters (nodes) for popular items, scattered throughout. During sales, customers can quickly find what they need from any nearby counter, avoiding long queues. This setup is excellent for handling heavy traffic and large, diverse inventories, typical in e-commerce platforms. Online Gaming Normal caching: It’s like having one scoreboard in a small gaming arcade. Players can quickly see scores, but if too many players join, updating and checking scores becomes slow. Distributed caching: In a large gaming complex with scoreboards (cache nodes) in every section, players anywhere can instantly see updates. This is crucial for online gaming, where real-time data (like player scores or game states) needs fast, consistent updates across the globe. Real-Time Analytics Normal caching: It's similar to having a single newsstand that quickly provides updates on certain topics. It's faster than searching through a library but can get overwhelming during peak news times. Distributed caching: Picture a network of digital screens (cache nodes) across a city, each updating in real-time with news. For applications analyzing live data (like financial trends or social media sentiment), this means getting instant insights from vast, continually updated data sources. Choosing the Right Distributed Caching Solution When selecting a distributed caching solution, consider the following: Performance and latency: Assess the solution's ability to handle your application’s load, especially under peak usage. Consider its read/write speed, latency, and how well it maintains performance consistency. This factor is crucial for applications requiring real-time responsiveness. Scalability and flexibility: Ensure the solution can horizontally scale as your user base and data volume grow. The system should allow for easy addition or removal of nodes with minimal impact on ongoing operations. Scalability is essential for adapting to changing demands. Data consistency and reliability: Choose a consistency model (strong, eventual, etc.) that aligns with your application's needs. Also, consider how the system handles node failures and data replication. Reliable data access and accuracy are vital for maintaining user trust and application integrity. Security features: Given the sensitive nature of data today, ensure the caching solution has robust security features, including authentication, authorization, and data encryption. This is especially important if you're handling personal or sensitive user data. Cost and total ownership: Evaluate the total cost of ownership, including licensing, infrastructure, and maintenance. Open-source solutions might offer cost savings but consider the need for internal expertise. Balancing cost with features and long-term scalability is key for a sustainable solution. Implementing Distributed Caching Implementing distributed caching effectively requires a strategic approach, especially when transitioning from normal (single-node) caching. Here’s a concise guide: Assessment and Planning Normal caching: Typically involves setting up a single cache server, often co-located with the application server. Distributed caching: Start with a thorough assessment of your application’s performance bottlenecks and data access patterns. Plan for multiple cache nodes, distributed across different servers or locations, to handle higher loads and ensure redundancy. Choosing the Right Technology Normal caching: Solutions like Redis or Memcached can be sufficient for single-node caching. Distributed caching: Select a distributed caching technology that aligns with your scalability, performance, and consistency needs. Redis Cluster, Apache Ignite, or Hazelcast are popular choices. Configuration and Deployment Normal caching: Configuration is relatively straightforward, focusing mainly on the memory allocation and cache eviction policies. Distributed caching: Requires careful configuration of data partitioning, replication strategies, and node discovery mechanisms. Ensure cache nodes are optimally distributed to balance load and minimize latency. Data Invalidation and Synchronization Normal caching: Less complex, often relying on TTL (time-to-live) settings for data invalidation. Distributed caching: Implement more sophisticated invalidation strategies like write-through or write-behind caching. Ensure synchronization mechanisms are in place for data consistency across nodes. Monitoring and Maintenance Normal caching: Involves standard monitoring of cache hit rates and memory usage. Distributed caching: Requires more advanced monitoring of individual nodes, network latency between nodes, and overall system health. Set up automated scaling and failover processes for high availability. Security Measures Normal caching: Basic security configurations might suffice. Distributed caching: Implement robust security protocols, including encryption in transit and at rest, and access controls. Challenges and Best Practices Challenges Cache invalidation: Ensuring that cached data is updated or invalidated when the underlying data changes. Data synchronization: Keeping data synchronized across multiple cache nodes. Best Practices Regularly monitor cache performance: Use monitoring tools to track hit-and-miss ratios and adjust strategies accordingly. Implement robust cache invalidation mechanisms: Use techniques like time-to-live (TTL) or explicit invalidation. Plan for failover and recovery: Ensure that your caching solution can handle node failures gracefully. Conclusion Distributed caching is an essential component in the architectural landscape of modern applications, especially those requiring high performance and scalability. By understanding the fundamentals, evaluating your needs, and following best practices, you can harness the power of distributed caching to elevate your application's performance, reliability, and user experience. As technology continues to evolve, distributed caching will play an increasingly vital role in managing the growing demands for fast and efficient data access.
Many different roles in the technology world come into contact with data normalization as a routine part of many projects. Developers, database administrators, domain modelers, business stakeholders, and many more progress through the normalization process just as they would breathing. And yet, can something that seems so integral become obsolete? As the database landscape becomes more diverse and hardware becomes more powerful, we might wonder if the practice of data normalization is required anymore. Should we be fretting over optimizing data storage and querying so that we return the minimum amount of data? Or if we should, do certain data structures make it more vital to solve those problems than others? In this article, we will review the process of data normalization and evaluate when this is needed, or if it is still a necessary part of digitally storing and retrieving data. What Is Data Normalization? Data normalization is optimizing data structures in a relational database to ensure data integrity and query efficiency. It reduces redundancy and improves accuracy by putting the data through a series of steps to normalize the structure (normal forms). At its core, data normalization helps avoid insert, update, and delete data anomalies. These anomalies occur when creating new data, updating existing data, or deleting data and cause challenges in keeping data values in sync (integrity). We will talk more about this when we step through the normalization process. The steps entail verifying keys (links to related data), separating unrelated entities to other tables, and inspecting the row and columns as a unified data object. While the full list of normal form steps is rather rigorous, we will focus on those most commonly applied in business practice: 1st, 2nd, and 3rd normal forms. Other normal forms are mostly used in academics and statistics. Normal form steps must be done in order, and we cannot move to the next normal form until the previous is complete. How Do We Do Data Normalization? Since we have three normal forms to take our data through, we will have three steps to this process. They are as follows: 1st normal form (1NF) 2nd normal form (2NF) 3rd normal form (3NF) A database professor from college taught my class to memorize the three normal forms as “the key, the whole key, and nothing but the key” (like the courtroom oath for the truth). I had to refresh some of the normal form details for this article, but that basic phrase has always stuck with me. Hopefully, it might help you remember them, too. I recently came across a coffee shop data set that seems to be a good fit for us to use as an example of normalizing a data set. With a bit of adjustment for our examples here, we can step through the process. Denormalized Data transaction_date transaction_time instore_yn customer loyalty_num line_item_id product quantity unit_price promo_item_yn 2019-04-01 12:24:53 Y Camille Tyler 102-192-8157 1 Columbian Medium Roast Sm 1 2.00 N 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 1,2 Jamaican Coffee River Sm, Oatmeal Scone 1,1 2.45,3.00 N,N 2019-04-01 16:44:46 Y Stuart Nunez 796-362-1661 1 Morning Sunrise Chai Rg 2 2.50 N 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 1,2 Cappuccino Lg, Jumbo Savory Scone 2,1 4.25,3.75 N,N The data contains sales receipts for the company and was originally published on the Kaggle Coffee Shop sample data repository, though I have also created a GitHub repository for today’s post. The data displayed above shows sales made to a customer for ordered products. Why is this data a problem? Earlier, we mentioned normalization to solve insert, update, and delete anomalies. If we try to insert a new row to this data, we could create a duplicate, or worse yet, have to gather all of the information on customers, products, and receipt date/time in order to create it. If we needed to update or delete the products purchased on the receipt, we would need to sort through the list in each product column to search for the value. So let’s see how to improve redundancy and integrity by normalizing this data. 1st Normal Form: The Key For the first step in our “key, the whole key, and nothing but the key," a table should have a primary key (single or set of columns) that ensures a row is unique. Each column in a row should also contain only a single value; i.e., no nested tables. Our example data set needs some work to get it to 1NF. While we can get unique rows with a combination of date/time or maybe date/time/customer, it is often much simpler to reference rows with a generated unique value of some sort. Let’s do that by adding a transaction_id field to our receipt table. There are also several rows that have more than one item ordered (transaction_id 156 and 199), so a few columns have line items with more than one value. We can correct this by separating the rows with multiple values into separate rows. 1NF Data transaction_id transaction_date transaction_time instore_yn customer loyalty_num line_item_id product quantity unit_price promo_item_yn 150 2019-04-01 12:24:53 Y Camille Tyler 102-192-8157 1 Columbian Medium Roast Sm 1 2.00 N 156 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 1 Jamaican Coffee River Sm 1 2.45 N 156 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 2 Oatmeal Scone 1 3.00 N 165 2019-04-01 16:44:46 Y Stuart Nunez 796-362-1661 1 Morning Sunrise Chai Rg 2 2.50 N 199 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 1 Cappuccino Lg 2 4.25 N 199 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 2 Jumbo Savory Scone 1 3.75 N With this data, a composite (multi-column) key uniquely identifies a row by the combination of transaction_id and line_item_id, as a single receipt cannot contain multiple line item #1s. Here is a look at the data if we just boil the table down to those primary key values. transaction_id line_item_id 150 1 156 1 156 5 165 1 199 1 199 5 Each combination of those two values is unique. We have applied the first normal form to our data, but we still have some potential data anomalies. If we wanted to add a new receipt, we might need to create multiple lines (depending on how many line items it contained), and duplicate transaction ID, date, time, and other information on each row. Updates and deletes cause similar problems because we would need to ensure we get all the rows affected for data to be consistent. This is where the second normal form comes into play. 2nd Normal Form: The Whole Key The second normal form ensures that each non-key column is fully dependent on the whole key. This is more of a concern for tables with more than one column as the primary key (like our coffee receipt table). Here is our data again in its first normal form: transaction_id transaction_date transaction_time instore_yn customer loyalty_num line_item_id product quantity unit_price promo_item_yn 150 2019-04-01 12:24:53 Y Camille Tyler 102-192-8157 1 Columbian Medium Roast Sm 1 2.00 N 156 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 1 Jamaican Coffee River Sm 1 2.45 N 156 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 2 Oatmeal Scone 1 3.00 N 165 2019-04-01 16:44:46 Y Stuart Nunez 796-362-1661 1 Morning Sunrise Chai Rg 2 2.50 N 199 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 1 Cappuccino Lg 2 4.25 N 199 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 2 Jumbo Savory Scone 1 3.75 N We will need to evaluate each non-key field to see if we have any partial dependencies; i.e., the column depends on only part of the key and not the whole key. Since transaction_id and line_item_id make up our primary key, let’s start with the transaction_date field. The transaction date does depend on the transaction ID, as the same transaction ID could not be used again on another day. However, the transaction date doesn’t depend on the line item ID at all. Line items can be reused across transactions, days, and customers even. Ok, so we already found that the table does not follow the second normal form, but let’s check another column. What about the customer column? The customer is not dependent on both the transaction ID and line item ID. If someone gave us a transaction ID, we would know which customer made the purchase, but if we were given a line item ID, we wouldn’t know which single customer that receipt belonged to. After all, multiple customers could have ordered one, two, or six items on their receipts. The customer is linked to the transaction ID (assume multiple customers cannot split receipts), but the customer is not dependent upon the line item. We need to fix these partial dependencies. The most direct solution is to create a separate table for order line items, leaving the columns that are only dependent on transaction_id in the receipt table. The updated data in the second normal form looks like the one below. Receipt transaction_id transaction_date transaction_time instore_yn customer loyalty_num 150 2019-04-01 12:24:53 Y Camille Tyler 102-192-8157 156 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 165 2019-04-01 16:44:46 Y Stuart Nunez 796-362-1661 199 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 Receipt Line Item transaction_id line_item_id product_id product quantity unit_price promo_item_yn 150 1 28 Columbian Medium Roast Sm 1 2.00 N 156 1 34 Jamaican Coffee River Sm 1 2.45 N 156 2 77 Oatmeal Scone 1 3.00 N 165 1 54 Morning Sunrise Chai Rg 2 2.50 N 199 1 41 Cappuccino Lg 2 4.25 N 199 2 79 Jumbo Savory Scone 1 3.75 N Now let’s test that our change fixed the issue and follows the second normal form. For our Receipt table, transaction_id becomes the sole primary key. Transaction date is unique based on the transaction_id, as is transaction_time; i.e., there can only be one date and time for a transaction id. Orders cannot be placed both in-store or outside it, so the value of whether a purchase was made in-store or not is dependent upon the transaction_id. Since customers cannot split a receipt, a transaction would also tell us a unique customer. Finally, if someone gave us a transaction ID, we could identify a single customer loyalty number that is attached to it. Next is the Receipt Line Item table. Line items are dependent upon the transaction (receipt) with which they are associated, so we retained the transaction ID on our line item table. The combination of transaction_id and line_item_id becomes our composite key on the line item table. Product_id and product are determined based on the transaction and line item together. A single transaction ID wouldn’t tell us which product (if the receipt contains multiple products purchased), and a single line item ID wouldn’t tell us which purchase was being referenced (different receipts could order the same products). This means the product_id and product values are dependent on the whole key. We can also associate a quantity from the transaction_id and line_item_id. Quantities could be the same across receipts or line item IDs, but the combination of both keys gives us a single value for quantity. We also cannot uniquely identify our unit_price or promo_item_yn column values without both the transaction ID and line item ID fields together. Although we have satisfied the second normal form, some data anomalies still exist. If we tried to create a new product for purchase or a new customer, we couldn’t create them in our current tables because we may not have receipts tied to them yet. If we needed to update a product or customer (for typo or name change), we would need to update all the line item rows with those values. If we wanted to delete a product or customer, we couldn’t unless we removed receipts or line items that referenced them. To solve these problems, we can move to the third normal form. 3rd Normal Form: And Nothing but the Key The third normal form ensures that non-key fields are dependent on nothing but the key. In other words, they are not dependent on other non-key fields, causing a transitive dependency. Let’s review our 2NF data again: Receipt transaction_id transaction_date transaction_time instore_yn customer loyalty_num 150 2019-04-01 12:24:53 Y Camille Tyler 102-192-8157 156 2019-04-01 12:30:00 N Griffith Lindsay 769-005-9211 165 2019-04-01 16:44:46 Y Stuart Nunez 796-362-1661 199 2019-04-01 14:24:55 Y Allistair Ramirez 253-876-9471 Receipt Line Item transaction_id line_item_id product_id product quantity unit_price promo_item_yn 150 1 28 Columbian Medium Roast Sm 1 2.00 N 156 1 34 Jamaican Coffee River Sm 1 2.45 N 156 2 77 Oatmeal Scone 1 3.00 N 165 1 54 Morning Sunrise Chai Rg 2 2.50 N 199 1 41 Cappuccino Lg 2 4.25 N 199 2 79 Jumbo Savory Scone 1 3.75 N On our Receipt table, we need to check the non-key fields (everything except transaction_id) to see if the values depend on other non-key fields. The values for transaction date, time, and in-store do not change based on each other or the customer or loyalty number associated, so they are properly dependent on nothing but the key. But what about customer info? The value of the loyalty number could change if the customer changes. For instance, if we needed to delete or update the customer who made the purchase, the loyalty number would also need to be deleted or updated along with it. So, the loyalty number is dependent on the customer, which is a non-key field. This means our Receipt table is not in the third normal form. What about our Receipt Line Item table? Quantity, unit price, and promo item values don’t vary based on the values of each other, nor on the product information, because the three fields state the value of an item at the time of purchase. However, the product is dependent on the product_id because the value would change based on which product ID was referenced. So this table also needs some updates to comply with the third normal form. Again, the best method to solve these issues is to pull the related columns to separate tables and leave a reference ID (foreign key) to link the original tables with the new ones. We eliminate data anomalies on insert, update, and delete, as well as reduce data redundancy and improve efficiency for storage and querying. Receipt transaction_id transaction_date transaction_time instore_yn customer_id 150 2019-04-01 12:24:53 Y 604 156 2019-04-01 12:30:00 N 32 165 2019-04-01 16:44:46 Y 127 199 2019-04-01 14:24:55 Y 112 Receipt Line Item transaction_id line_item_id product_id quantity unit_price promo_item_yn 150 1 28 1 2.00 N 156 1 34 1 2.45 N 156 2 77 1 3.00 N 165 1 54 2 2.50 N 199 1 41 2 4.25 N 199 2 79 1 3.75 N Product product_id product 28 Columbian Medium Roast Sm 34 Jamaican Coffee River Sm 77 Oatmeal Scone 54 Morning Sunrise Chai Rg 41 Cappuccino Lg 79 Jumbo Savory Scone Customer customer_id customer loyalty_num 604 Camille Tyler 102-192-8157 32 Griffith Lindsay 769-005-9211 127 Stuart Nunez 796-362-1661 112 Allistair Ramirez 253-876-9471 Data Normalization Outside Relational Databases So does this process of data normalization make sense outside of other databases? Is it needed for document, columnar, key-value, and/or graph databases? From my perspective, the goals of data normalization - reducing redundancy, improving data integrity, and increasing query performance - are still highly valuable no matter the database you are working with. However, the process and rules for the normal forms in relational data normalization likely are not a one-for-one match with other data models. Let’s see some examples using our memory key of “the key, the whole key, and nothing but the key” for three main categories of databases: relational, document, and graph. The relational databases’ goal was to optimize for assembling data into various meaningful sets by joining the tables in SQL queries. We already stepped through the normalization process from this perspective, so the benefits of redundancy, efficiency, and data integrity hopefully are clear from our earlier discussion. In document databases, the model is optimized for grouping related information together in a single document, so that looking up a single customer retrieves all receipts and any other details. This statement is already in conflict with our data redundancy goal because we could potentially duplicate or allow inconsistencies in product information in order to store that data with a customer. Primary keys for documents that will be queried still make sense to avoid multiple lookups, but additional normalization steps may or may not be in conflict with the goals of the database model itself. Graph databases balance the data integrity provided by relational and pre-baked relationship data provided by documents to create a unique model optimized for assembling data relationships without creating more data redundancy. Unique entities via a primary key are still important to improve query and storage efficiency but joins are stored as separate entities, naturally teasing related data into separate entities without analyzing each field for partial or non-key dependencies. Normalization exists here, but it feels more organic and less process-driven. Wrap Up In summary, we covered the process of data normalization as it pertains to the traditional relational database world. We discussed each step of the three normal forms and applied each one to a coffee shop receipt data set. Finally, we looked at whether data normalization looks like in other types of databases (document and graph) and what forms made sense based on the structure of the database model.
Have you ever wondered how data flows within a software system? How is information processed and transformed, and how does it deliver value? Data Flow Diagrams (DFDs) are a "visual language" that may answer such questions. An important tool for understanding how data moves in a software system, DFDs provide a visual representation of the flow of data from its entry point to its final destination and highlight data transformations along the way. Whether you're a tester, a seasoned developer, a budding programmer, or a stakeholder involved in system design and architecture, understanding DFDs unlocks a valuable skillset. This article provides fundamental knowledge about DFDs, highlighting their benefits and guiding you on how to leverage them effectively. We start with DFD basics and a set of steps on how to create a DFD. An inventory management system is used as an example, where we design a sample of basic test cases based on DFDs. The benefits and limitations of DFDs are also explored. We finish by providing tools available for DFDs, highlighting their strengths and weaknesses. The Orchestra A useful metaphor is that of the orchestra, where data flows like musical notes played by different instruments. DFDs act as the conductor's score, outlining the movement and transformation of these notes. Here's how the orchestra translates to DFD components: Musicians (data sources): These are the violins, flutes, and other instruments that provide the initial musical data (e.g., customer records from a database, sensor readings from a device). Audience (data destinations): The audience represents the final recipients of the music, just like reports generated for management or data sent to another system for further processing. Sheet music stands (data stores): The music stands holding the sheet music – these are like databases, files, or in-memory buffers that temporarily or permanently store data. Musical phrases (data flows): The flowing melodies and harmonies translate to data flows, depicted as arrows connecting different components. The conductor (data processes): Just as the conductor guides the musicians and shapes the music, data processes represent transformations happening to the data as it flows (e.g., calculations, filtering, data manipulation). By analyzing the score (DFD), we can understand the complete musical journey. Similarly, understanding data flow helps us comprehend how information moves and evolves within a system. How To Create a DFD In short, creating a DFD is an iterative process that could benefit from pairing activities and feedback from others. We start by defining scope, entities, and other parameters. We are done when we can answer some basic questions as mentioned below. 1. Define the System Scope The first step is to clearly define the boundaries of the system you're modeling. What are the system's functionalities? What are the external entities it interacts with? Having a well-defined scope ensures your DFD focuses on the relevant data flows within the system. 2. Identify External Entities List all the external entities that interact with the system. Are there users entering data through a web interface? Does the system receive data from another system via an API? Each entity should be named clearly, reflecting its role in data exchange. 3. Pinpoint Data Flows For each external entity, identify the data it sends to the system (inputs) and the data it receives from the system (outputs). Label the data flows with descriptive names that capture the essence of the data being transferred. 4. Introduce Data Stores Identify the data repositories within the system. What kind of data does the system store? Does it maintain a database of customer information or a log file of system activity? Each data store should be named appropriately, reflecting the data it holds. 5. Outline Data Processes Define the transformations that occur on the data as it flows through the system. How is the customer order data processed to generate an invoice? How is the sensor data filtered before analysis? Each data process should be labeled with a clear description of its function. 6. Diagramming the DFD Once you have identified all the elements, it's time to visually represent them using a DFD tool or even a simple drawing tool. Use standard symbols for external entities (rectangles), data flows (arrows), data stores (cylinders), and data processes (rounded rectangles). 7. Leveling Up: Context and DFD Levels A single DFD might not capture the entire system's complexity. Here's where the concept of DFD levels comes in: Context Diagram (Level 0): This high-level overview depicts the system as a single process interacting with external entities. Level 1 DFD: This level decomposes the single process from the context diagram into more detailed sub-processes, showcasing data flows and data stores within the system. Level 2 DFD (and beyond): Further decomposition can occur, focusing on specific sub-processes from Level 1 DFDs, providing even greater detail on data flow within those sub-processes. 8. Refining and Validating Once your DFD is drafted, review it for accuracy and completeness. Do the data flows connect to the correct entities and processes? Are the data transformations within processes clearly defined? Seek feedback from stakeholders to ensure the DFD accurately reflects the system's intended behavior. How to Level Up To help grasp the main idea here, another metaphor could help. Each city has neighborhoods, streets, and hidden alleyways. A single map might not capture every detail. Similarly, a single DFD might not encapsulate the detailed data flow within a complex system. This is where DFD levels come into play, offering a hierarchical approach to visualize data flow at different granularities. 1. Context Diagram (Level 0) The big picture: This is a city map from a helicopter. It depicts the entire system as a single, high-level process. This process interacts with various external entities, represented by rectangles. Data flows (arrows) showcase the exchange of information between the system and these entities. Focus: The context diagram provides a broad overview, highlighting the system's purpose and its interaction with the external world. It's ideal for high-level discussions and initial system understanding. 2. Level 1 DFD Delving deeper: Now we descend into the city! This level decomposes the single process from the context diagram into more detailed sub-processes. These sub-processes, represented by rounded rectangles, showcase the internal workings of the system. Data flow and stores: Level 1 DFDs could depict data flows (arrows) connecting these sub-processes. They also introduce data stores (cylinders) representing the system's internal repositories where data is temporarily or permanently held (e.g., databases, files). Increased detail: This level offers a more granular view of how data flows within the system. It is a more revealing level of the functionalities performed by each sub-process and how they interact with data stores. 3. Level 2 DFD (and Beyond) Zooming in on specific areas: Here, we are exploring a specific neighborhood within the city. Level 2 DFDs (and potentially even further levels) take a sub-process from the Level 1 DFD and break it down even further. They focus on the data flow within that specific sub-process. Greater clarity for complex functions: This level is particularly useful for complex functionalities within the system. By decomposing them into smaller, more manageable components, DFDs provide a clearer understanding of how data is manipulated and transformed within each sub-process. Choosing the Right Level The appropriate DFD level depends on the system's complexity and the level of detail required. Context diagram: Ideal for initial system understanding and high-level communication Level 1 DFD: Provides a good balance between overview and detail, useful for design and development discussions Level 2 DFD (and beyond): Focuses on specific functionalities, helpful for in-depth analysis and documentation By leveraging DFD levels, you can create a comprehensive set of diagrams that effectively capture the data flow within a system, from a high-level overview to a granular examination of specific processes. This layered approach ensures clear communication and understanding for all stakeholders involved in system design and development. Inventory Management System Let's assume that our software manages the inventory for a small store. Here's a simplified breakdown of its functionalities: User adds new items: The user interacts with the system to add new items to the inventory. This involves providing details like item name, description, quantity, and price. Inventory data validation: The system validates the entered data, ensuring required fields are filled and data formats are correct (e.g., positive quantities, valid price format). Inventory update: If validation passes, the new item is added to the inventory database, or an existing item's quantity is updated. Item search: The user can search for items in the inventory by name or other criteria. Inventory report: The user can generate a report summarizing the current inventory status, including item names, quantities, and total values. A context diagram for the inventory management system could simply depict that a user interacts with the system by exchanging inventory data, as shown below. A Level 1 DFD may decompose the context diagram, as shown below. It focuses on core functionalities. You can extend it to include additional data flows, such as user authentication or managing low-stock alerts, among others. A Level 2 DFD may focus and expand on any item from the Level 1 DFD. As an example, we will create a Level 2 DFD by focusing on the "Add New Item" functionality from the Level 1 DFD. Focus is given on data validation for new items. You can extend it to include data processing for adding the item to the database, for example. The specific validation rules can be further customized based on your system's requirements (e.g., validating price range or description length). Level 2 DFDs can be created for other functionalities like "Item Search" or "Inventory Report" to provide a more granular view of data flow within those processes. Test Case Design Based on DFD Now, let's leverage the DFD to design test cases that cover various scenarios and potential data flows: 1. User Interface Testing Test Case 1.1 Enter valid item details (name, description, quantity > 0, price > 0). Expected result: Item is successfully added to the inventory. Test Case 1.2 Leave a field blank (e.g., no item name). Expected result: The system displays an error message prompting the user to fill in the required field. Test Case 1.3 Enter an invalid quantity (e.g., negative number). Expected result: The system displays an error message indicating an invalid quantity format. Test Case 1.4 Enter an invalid price format (e.g., letters instead of numbers). Expected result: The system displays an error message indicating an invalid price format. 2. Inventory Data Validation Testing Test Case 2.1 Enter a duplicate item name for a new item. Expected result: The system displays an error message indicating that the item already exists. Test Case 2.2 Enter a very long item name (exceeding a defined limit). Expected result: The system displays an error message indicating that the item name is too long. 3. Inventory Update Testing Test Case 3.1 Add a new item with a valid quantity. Expected result: The item is added to the inventory database with the correct quantity. Test Case 3.2 Update the quantity of an existing item. Expected result: The inventory database is updated with the new quantity for the item. Test Case 3.3 Attempt to update the quantity of a non-existent item. Expected result: The system displays an error message indicating that the item cannot be found. 4. Item Search Testing Test Case 4.1 Search for an item by its exact name (case-sensitive). Expected result: The system accurately retrieves the item information. Test Case 4.2 Search for an item by a partial name match (case-insensitive). Expected result: The system retrieves all items that match the partial name (if applicable). Test Case 4.3 Search for an item that doesn't exist in the inventory. Expected result: The system displays a message indicating that no items match the search criteria. 5. Inventory Report Testing Test Case 5.1 Generate a report when the inventory is empty. Expected result: The report displays a message indicating no items in the inventory. Test Case 5.2 Generate a report with various items in the inventory. Expected result: The report accurately lists all items with their names, quantities, and calculated total values. Test Case 5.3 Generate a report in a specific format (e.g., CSV, PDF). Expected result: The report is generated in the requested format with correct data representation. Remember This is a simplified example, and the specific test cases will vary depending on your functionality. DFDs are a valuable tool for identifying key data flows and system components. By analyzing these flows, you can design test cases to ensure the system functions as expected. Benefits of Data Flow Diagrams Creating DFDs offers a multitude of benefits for software development projects: Enhanced Communication DFDs provide a clear "visual language" that stakeholders, both technical and non-technical, can understand. This can improve communication and collaboration during the design phase. Improved System Design By visualizing data flow, potential bottlenecks or inefficiencies in data processing can be identified early on. This allows for a more optimized system design. Clearer Data Requirements DFDs highlight the data required by the system to function effectively. This can help to define data storage needs and design appropriate database structures. Documentation and Maintenance DFDs serve as valuable documentation throughout the development lifecycle. They provide a reference point for project managers, architects, developers, testers, and future maintainers, ensuring a clear understanding of the system's data flow. Early Error Detection By visualizing data flow, potential inconsistencies or missing data transformations can be identified before coding commences. This leads to fewer errors during development and reduces the need for costly rework later in the project. Limitations of Data Flow Diagrams Here are some of the limitations of DFDs: Limited Control Flow Representation DFDs primarily focus on data flow and transformations. They don't explicitly depict the order or decision logic within processes. While some tools might offer symbols for conditional flows, DFDs aren't ideal for representing complex control flow logic. This can be crucial for understanding how a system behaves under different conditions. Data Structure Complexity DFDs represent data flows with simple labels, which might not adequately capture the structure and complexity of data. For systems dealing with complex data objects or hierarchical relationships, this may be an issue. Scalability for Large Systems For very large and complex systems, creating and managing a single, comprehensive DFD can become cumbersome. The sheer number of elements and data flows can make the diagram difficult to understand and maintain. Focus on Functionality, Not Implementation DFDs primarily depict the "what" of a system, focusing on functionalities and data flow. They don't directly translate to specific code or implementation details. Additional documentation might be needed to bridge the gap between the DFD and the actual system implementation. Limited User Interaction Modeling While DFDs can represent basic user interactions with the system as external entities, they don't depict the user interface or user experience (UX) aspects in detail. Additional tools or diagrams might be required to capture these aspects effectively. In spite of these limitations, DFDs remain a valuable tool for system design and communication. Here are some strategies to mitigate these limitations: Use swimlane diagrams for complex control flow: These diagrams can visually represent decision points and alternative paths within a process. Create separate DFDs for different functionalities: Breaking down a large system into smaller, more manageable DFDs can improve readability and maintainability. Combine DFDs with other tools: Use DFDs in conjunction with other diagrams like structure charts or user flow diagrams to provide a more comprehensive picture of the system. Document data structures separately: Create additional documentation that details the structure and relationships within your data objects. By understanding these limitations and employing appropriate strategies, you can leverage DFDs effectively to design, document, and communicate system functionalities. Tools for Data Flow Diagrams Here are the most popular tools for creating DFDs, along with a comparison of their key features: 1. Lucidchart Strengths Cloud-based: Accessible from any device with a web browser, good for remote collaboration User-friendly: Drag-and-drop functionality simplifies DFD creation Real-time collaboration: Enables multiple users to work on the same DFD simultaneously Integration: Connects seamlessly with various project management and development tools Rich template library: Offers a comprehensive library of DFD symbols and templates Weaknesses Cost: The free plan has limited features and storage space. Advanced features require paid subscriptions. 2. Microsoft Visio Strengths Industry-standard: Widely recognized and used across various industries Extensive library: Provides a vast collection of DFD symbols and templates for detailed diagrams Customization options: Allows extensive customization of shapes, lines, and styles Integration: Offers strong integration with other Microsoft Office products Weaknesses Cost: Can be expensive, especially for single users Learning curve: Steeper learning curve compared to simpler tools Overkill for basic DFDs: There might be more than needed for creating simple DFDs. 3. Draw.io (Formerly Gliffy) Strengths Free and open-source: No licensing costs and accessible to everyone Cross-platform: Available as a web-based interface and a desktop app Large symbol library: Offers a wide range of shapes and templates, including DFD symbols Export options: Allows exporting diagrams in various image formats (PNG, JPG) and SVG for further editing Weaknesses Limited collaboration: Collaboration features are less robust compared to some paid tools. Fewer advanced features: Lacks some advanced features like shape customization or data import/export 4. yEd Graph Editor Strengths Free and open-source: Another free option for DFD creation Flexibility: Offers flexibility for creating custom shapes and layouts Data import/export: Supports importing/exporting data in various formats, useful for complex diagrams Weaknesses Learning curve: The user interface might be less intuitive compared to drag-and-drop tools. Collaboration: Lacks some of the collaborative features of cloud-based tools 5. Microsoft Word Strengths Readily available: Most users already have access to Microsoft Word, making it a convenient option. Basic functionalities: Offers basic shape drawing capabilities and limited DFD symbol options Documentation: This can be sufficient for creating simple DFDs for documentation purposes within Word documents. Weaknesses Limited capabilities: Not a dedicated diagramming tool, making it cumbersome for complex DFDs Missing features: Lacks advanced features like shape customization, layout options, and robust collaboration Choosing the Right Tool The best tool for you depends on your specific needs. Here's a quick guide: For simple DFDs with limited collaboration needs: Draw.io or Word might be sufficient. For complex DFDs and collaboration: Lucidchart or Visio are good choices with advanced features. For budget-conscious users: Draw.io and yEd Graph Editor are free alternatives. For Microsoft Office users who need basic DFD creation: Microsoft Word can suffice for simple diagrams. Consider the factors mentioned above and try these tools to see which one best suits your workflow and project requirements. Wrapping Up Data Flow Diagrams are a cornerstone of effective software development. By mastering DFD creation, you gain a powerful tool for understanding, visualizing, and documenting the flow of data within a system. This empowers you to design efficient data processing workflows, identify potential issues early on, and communicate clearly with stakeholders throughout the development process.
The modern data stack represents the evolution of data management, shifting from traditional, monolithic systems to agile, cloud-based architectures. It's designed to handle large amounts of data, providing scalability, flexibility, and real-time processing capabilities. This stack is modular, allowing organizations to use specialized tools for each function: data ingestion, storage, transformation, and analysis, facilitating a more efficient and democratized approach to data analytics and business operations. As businesses continue to prioritize data-driven decision-making, the modern data stack has become integral to unlocking actionable insights and fostering innovation. The Evolution of Modern Data Stack The Early Days: Pre-2000s Companies use big, single systems to keep and manage their data. These were good for everyday business tasks but not so much for analyzing lots of data. Data was stored in traditional relational databases like Oracle, IBM DB2, and Microsoft SQL Server. The Big Data Era: Early 2000s - 2010s This period marked the beginning of a shift towards systems that could handle massive amounts of data at high speeds and in various formats. We started to see a lot more data from all over, and it was coming in fast. New tech like Hadoop helped by spreading out the data work across many computers. The Rise of Cloud Data Warehouses: Mid-2010s Cloud computing started to revolutionize data storage and processing. Cloud data warehouses like Amazon Redshift and Google BigQuery offered scalability and flexibility, changing the economics and speed of data analytics. Also, Snowflake, a cloud-based data warehousing startup, emerged, offering a unique architecture separating computing and storage. The Modern Data Stack: Late 2010s - Present The modern data stack took shape with the rise of ELT processes, SaaS-based data integration tools, and the separation of storage and compute. This era saw the proliferation of tools designed for specific parts of the data lifecycle, enabling a more modular and efficient approach to data management. Limitations of Traditional Data Systems In my data engineering career, across several organizations, I've extensively worked with Microsoft SQL Server. This section will draw from those experiences, providing a personal touch as I recount the challenges faced with this traditional system. Later, we'll explore how the Modern Data Stack (MDS) addresses many of these issues; some solutions were quite a revelation to me! Scalability Traditional SQL Server deployments were often hosted on-premises, which meant that scaling up to accommodate growing data volumes required significant hardware investments and could lead to extended downtime during upgrades. What's more, when we had less data to deal with, we still had all these extra hardware that we didn't really need. But we were still paying for them. It was like paying for a whole bus when you only need a few seats. Complex ETL SSIS was broadly used for ETL; while it is a powerful tool, it had certain limitations, especially when compared to more modern data integration solutions. Notably, Microsoft SQL Server solved a lot of these limitations in Azure Data Factory and SQL Server Data Tools (SSDT). API calls: SSIS initially lacked direct support for API calls. Custom scripting was required to interact with web services, complicating ETL processes. Memory allocation: SSIS jobs needed careful memory management. Without enough server memory, complex data jobs could fail. Auditing: Extensive auditing within SSIS packages was necessary to monitor and troubleshoot, adding to the workload. Version control: Early versions of SSIS presented challenges with version control integration, complicating change tracking and team collaboration. Cross-platform accessibility: Managing SSIS from non-Windows systems was difficult, as it was a Windows-centric tool. Maintenance Demands The maintenance of on-premises servers was resource-intensive. I recall the significant effort required to ensure systems were up-to-date and running smoothly, often involving downtime that had to be carefully managed. Integration Integrating SQL Server with newer tools and platforms was not always straightforward. It sometimes required creative workarounds, which added to the complexity of our data architecture. How the Modern Data Stack Solved My Data Challenges The Modern Data Stack (MDS) fixed a lot of the old problems I had with SQL Server. Now, we can use the cloud to store data, which means no more spending on big, expensive servers we might not always need. Getting data from different places is easier because there are tools that do it all for us, and there is no more tricky coding. When it comes to sorting and cleaning up our data, we can do it straight into the database with simple commands. This avoids the headaches of managing big servers or digging through tons of data to find a tiny mistake. And when we talk about keeping our data safe and organized, the MDS has tools that make this super easy and way less of a chore. So with the MDS, we're saving time, we can move quicker, and it's a lot less hassle all around. It's like having a bunch of smart helpers who take care of the tough stuff so we can focus on the cool part—finding out what the data tells us. Components of the Modern Data Stack MDS is made up of various layers, each with specialized tools that work together to streamline data processes. Data Ingestion and Integration The extraction and loading of data from diverse sources, including APIs, databases, and SaaS applications. Ingestion tools fivetran, stitch, airbyte, segment, etc. Data Storage Modern cloud data warehouses and data lakes offer scalable, flexible, and cost-effective storage solutions. Cloud Data Warehouses Google Bigquery, Snowflake, Redshift, etc. Data Transformation Tools like dbt (data build tool) enable transformation within the data warehouse using simple SQL, improving upon traditional ETL processes. Data Analysis and Business Intelligence The analytics and Business Intelligence tools allow for advanced data exploration, visualization, and sharing of insights across the organization. Business Intelligence Tools Tableau, Looker, Power BI, Good Data Data Extraction and Reverse ETL Enables organizations to operationalize their warehouse data by moving it back into business applications, driving action from insights. Reverse ETL tools Hightouch, Census Data Orchestration Platforms that help automate and manage data workflows, ensuring that the right data is processed at the right time. Orchestration Tools Airflow, Astronomer, Dagster, AWS Step Functions Data Governance and Security Data governance focuses on the importance of managing data access, ensuring compliance, and protecting data within the MDS. Data Governance also provides comprehensive management of data access, quality, and compliance while offering an organized inventory of data assets that enhances discoverability and trustworthiness. Data Catalog Tools Alation (for data cataloging), Collibra (for governance and cataloging), Apache Atlas. Data Quality Ensures data reliability and accuracy through validation and cleaning, providing confidence in data-driven decision-making. Data Quality Tools: Talend, Monte Carlo, Soda, Anomolo, Great Expectations Data Modeling Assists in designing and iterating database schemas easily, supporting agile and responsive data architecture practices. Modeling Tools Erwin, SQLDBM Conclusion: Embracing MDS With Cost Awareness The Modern Data Stack is pretty amazing; it's like having a Swiss army knife for handling data. It definitely makes things faster and less of a headache. But while it's super powerful and gives us a lot of cool tools, it's also important to keep an eye on the price tag. The pay-as-you-go pricing of the cloud is great because we only pay for what we use. But, just like a phone bill, if we're not careful, those little things can add up. So, while we enjoy the awesome features of the MDS, we should also make sure to stay smart about how we use them. That way, we can keep saving time without any surprises when it comes to costs.
Software engineers occupy an exciting place in this world. Regardless of the tech stack or industry, we are tasked with solving problems that directly contribute to the goals and objectives of our employers. As a bonus, we get to use technology to mitigate any challenges that come into our crosshairs. For this example, I wanted to focus on how pgvector — an open-source vector similarity search for Postgres — can be used to identify data similarities that exist in enterprise data. A Simple Use Case As a simple example, let’s assume the marketing department requires assistance for a campaign they plan to launch. The goal is to reach out to all the Salesforce accounts that are in industries that closely align with the software industry. In the end, they would like to focus on accounts in the top three most similar industries, with the ability to use this tool in the future to find similarities for other industries. If possible, they would like the option to provide the desired number of matching industries, rather than always returning the top three. High-Level Design This use case centers around performing a similarity search. While it is possible to complete this exercise manually, the Wikipedia2Vec tool comes to mind because of the pre-trained embeddings that have already been created for multiple languages. Word embeddings — also known as vectors — are numeric representations of words that contain both their syntactic and semantic information. By representing words as vectors, we can mathematically determine which words are semantically “closer” to others. In our example, we could also have written a simple Python program to create word vectors for each industry configured in Salesforce. The pgvector extension requires a Postgres database. However, the enterprise data for our example currently resides in Salesforce. Fortunately, Heroku Connect provides an easy way to sync the Salesforce accounts with Heroku Postgres, storing it in a table called salesforce.account. Then, we’ll have another table called salesforce.industries that contains each industry in Salesforce (as a VARCHAR key), along with its associated word vector. With the Salesforce data and word vectors in Postgres, we’ll create a RESTful API using Java and Spring Boot. This service will perform the necessary query, returning the results in the JSON format. We can illustrate the high-level view of the solution like this: The source code will reside in GitLab. Issuing a git push heroku command will trigger a deployment in Heroku, introducing a RESTful API that the marketing team can easily consume. Building the Solution With the high-level design in place, we can start building a solution. Using my Salesforce login, I was able to navigate to the Accounts screen to view the data for this exercise. Here is an example of the first page of enterprise data: Create a Heroku App For this effort, I planned to use Heroku to solve the marketing team’s request. I logged into my Heroku account and used the Create New App button to establish a new application called similarity-search-sfdc: After creating the app, I navigated to the Resources tab to find the Heroku Postgres add-on. I typed “Postgres” into the add-ons search field. After selecting Heroku Postgres from the list, I chose the Standard 0 plan, but pgvector is available on Standard-tier (or higher) database offerings running PostgreSQL 15 or the beta Essential-tier database. When I confirmed the add-on, Heroku generated and provided a DATABASE_URL connection string. I found this in the Config Vars section of the Settings tab of my app. I used this information to connect to my database and enable the pgvector extension like this: Shell CREATE EXTENSION vector; Next, I searched for and found the Heroku Connect add-on. I knew this would give me an easy way to connect to the enterprise data in Salesforce. For this exercise, the free Demo Edition plan works just fine. At this point, the Resources tab for the similarity-search-sfdc app looked like this: I followed the “Setting Up Heroku Connect” instructions to link my Salesforce account to Heroku Connect. Then, I selected the Account object for synchronization. Once completed, I was able to see the same Salesforce account data in Heroku Connect and in the underlying Postgres database. From a SQL perspective, what I did resulted in the creation of a salesforce.account table with the following design: SQL create table salesforce.account ( createddate timestamp, isdeleted boolean, name varchar(255), systemmodstamp timestamp, accountnumber varchar(40), industry varchar(255), sfid varchar(18), id serial primary key, _hc_lastop varchar(32), _hc_err text ); Generate Vectors In order for the similarity search to function as expected, I needed to generate word vectors for each Salesforce account industry: Apparel Banking Biotechnology Construction Education Electronics Engineering Entertainment Food & Beverage Finance Government Healthcare Hospitality Insurance Media Not For Profit Other Recreation Retail Shipping Technology Telecommunications Transportation Utilities Since the primary use case indicated the need to find similarities for the software industry, we would need to generate a word vector for that industry too. To keep things simple for this exercise, I manually executed this task using Python 3.9 and a file called embed.py, which looks like this: Python from wikipedia2vec import Wikipedia2Vec wiki2vec = Wikipedia2Vec.load('enwiki_20180420_100d.pkl') print(wiki2vec.get_word_vector('software').tolist()) Please note – the get_word_vector() method expects a lowercase representation of the industry. Running python embed.py generated the following word vector for the software word: Shell [-0.40402618050575256, 0.5711150765419006, -0.7885153293609619, -0.15960034728050232, -0.5692323446273804, 0.005377458408474922, -0.1315757781267166, -0.16840921342372894, 0.6626015305519104, -0.26056772470474243, 0.3681095242500305, -0.453583300113678, 0.004738557618111372, -0.4111144244670868, -0.1817493587732315, -0.9268549680709839, 0.07973367720842361, -0.17835664749145508, -0.2949991524219513, -0.5533796548843384, 0.04348105192184448, -0.028855713084340096, -0.13867013156414032, -0.6649054884910583, 0.03129105269908905, -0.24817068874835968, 0.05968991294503212, -0.24743635952472687, 0.20582349598407745, 0.6240783929824829, 0.3214546740055084, -0.14210252463817596, 0.3178422152996063, 0.7693028450012207, 0.2426985204219818, -0.6515568494796753, -0.2868216037750244, 0.3189859390258789, 0.5168254971504211, 0.11008890718221664, 0.3537853956222534, -0.713259220123291, -0.4132286608219147, -0.026366405189037323, 0.003034653142094612, -0.5275223851203918, -0.018167126923799515, 0.23878540098667145, -0.6077089905738831, 0.5368344187736511, -0.1210874393582344, 0.26415619254112244, -0.3066694438457489, 0.1471938043832779, 0.04954215884208679, 0.2045321762561798, 0.1391817331314087, 0.5286830067634583, 0.5764685273170471, 0.1882934868335724, -0.30167853832244873, -0.2122340053319931, -0.45651525259017944, -0.016777794808149338, 0.45624101161956787, -0.0438646525144577, -0.992512047290802, -0.3771328926086426, 0.04916151612997055, -0.5830298066139221, -0.01255014631897211, 0.21600870788097382, -0.18419665098190308, 0.1754663586616516, -0.1499166339635849, -0.1916201263666153, -0.22884036600589752, 0.17280352115631104, 0.25274306535720825, 0.3511175513267517, -0.20270302891731262, -0.6383468508720398, 0.43260180950164795, -0.21136239171028137, -0.05920517444610596, 0.7145522832870483, 0.7626600861549377, -0.5473887920379639, 0.4523043632507324, -0.1723199188709259, -0.10209759324789047, -0.5577948093414307, -0.10156919807195663, 0.31126976013183594, 0.3604489266872406, -0.13295558094978333, 0.2473849356174469, 0.278846800327301, -0.28618067502975464, 0.00527254119515419] Create Table for Industries In order to store the word vectors, we needed to add an industries table to the Postgres database using the following SQL command: SQL create table salesforce.industries ( name varchar not null constraint industries_pk primary key, embeddings vector(100) not null ); With the industries table created, we’ll insert each of the generated word vectors. We do this with SQL statements similar to the following: SQL INSERT INTO salesforce.industries (name, embeddings) VALUES ('Software','[-0.40402618050575256, 0.5711150765419006, -0.7885153293609619, -0.15960034728050232, -0.5692323446273804, 0.005377458408474922, -0.1315757781267166, -0.16840921342372894, 0.6626015305519104, -0.26056772470474243, 0.3681095242500305, -0.453583300113678, 0.004738557618111372, -0.4111144244670868, -0.1817493587732315, -0.9268549680709839, 0.07973367720842361, -0.17835664749145508, -0.2949991524219513, -0.5533796548843384, 0.04348105192184448, -0.028855713084340096, -0.13867013156414032, -0.6649054884910583, 0.03129105269908905, -0.24817068874835968, 0.05968991294503212, -0.24743635952472687, 0.20582349598407745, 0.6240783929824829, 0.3214546740055084, -0.14210252463817596, 0.3178422152996063, 0.7693028450012207, 0.2426985204219818, -0.6515568494796753, -0.2868216037750244, 0.3189859390258789, 0.5168254971504211, 0.11008890718221664, 0.3537853956222534, -0.713259220123291, -0.4132286608219147, -0.026366405189037323, 0.003034653142094612, -0.5275223851203918, -0.018167126923799515, 0.23878540098667145, -0.6077089905738831, 0.5368344187736511, -0.1210874393582344, 0.26415619254112244, -0.3066694438457489, 0.1471938043832779, 0.04954215884208679, 0.2045321762561798, 0.1391817331314087, 0.5286830067634583, 0.5764685273170471, 0.1882934868335724, -0.30167853832244873, -0.2122340053319931, -0.45651525259017944, -0.016777794808149338, 0.45624101161956787, -0.0438646525144577, -0.992512047290802, -0.3771328926086426, 0.04916151612997055, -0.5830298066139221, -0.01255014631897211, 0.21600870788097382, -0.18419665098190308, 0.1754663586616516, -0.1499166339635849, -0.1916201263666153, -0.22884036600589752, 0.17280352115631104, 0.25274306535720825, 0.3511175513267517, -0.20270302891731262, -0.6383468508720398, 0.43260180950164795, -0.21136239171028137, -0.05920517444610596, 0.7145522832870483, 0.7626600861549377, -0.5473887920379639, 0.4523043632507324, -0.1723199188709259, -0.10209759324789047, -0.5577948093414307, -0.10156919807195663, 0.31126976013183594, 0.3604489266872406, -0.13295558094978333, 0.2473849356174469, 0.278846800327301, -0.28618067502975464, 0.00527254119515419] '); Please note – while we created a word vector with the lowercase representation of the Software Industry (software), the industries.name column needs to match the capitalized industry name (Software). Once all of the generated word vectors have been added to the industries table, we can change our focus to introducing a RESTful API. Introduce a Spring Boot Service This was the point where my passion as a software engineer jumped into high gear, because I had everything in place to solve the challenge at hand. Next, using Spring Boot 3.2.2 and Java (temurin) 17, I created the similarity-search-sfdc project in IntelliJ IDEA with the following Maven dependencies: XML <dependencies> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <dependency> <groupId>com.pgvector</groupId> <artifactId>pgvector</artifactId> <version>0.1.4</version> </dependency> <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <scope>runtime</scope> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-configuration-processor</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <optional>true</optional> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> </dependencies> I created simplified entities for both the Account object and Industry (embedding) object, which lined up with the Postgres database tables created earlier. Java @AllArgsConstructor @NoArgsConstructor @Data @Entity @Table(name = "account", schema = "salesforce") public class Account { @Id @Column(name = "sfid") private String id; private String name; private String industry; } @AllArgsConstructor @NoArgsConstructor @Data @Entity @Table(name = "industries", schema = "salesforce") public class Industry { @Id private String name; } Using the JpaRepository interface, I added the following extensions to allow easy access to the Postgres tables: Java public interface AccountsRepository extends JpaRepository<Account, String> { @Query(nativeQuery = true, value = "SELECT sfid, name, industry " + "FROM salesforce.account " + "WHERE industry IN (SELECT name " + " FROM salesforce.industries " + " WHERE name != :industry " + " ORDER BY embeddings <-> (SELECT embeddings FROM salesforce.industries WHERE name = :industry) " + " LIMIT :limit)" + "ORDER BY name") Set<Account> findSimilaritiesForIndustry(String industry, int limit); } public interface IndustriesRepository extends JpaRepository<Industry, String> { } Notice the findSimilaritiesForIndustry() method is where all the heavy lifting will take place for solving this use case. The method will accept the following parameters: industry: the industry to find similarities for limit: the maximum number of industry similarities to search against when querying for accounts Note the Euclidean distance operator (<->) in our query above. This is the extension’s built-in operator for performing the similarity search. With the original “Software” industry use case and a limit set to the three closest industries, the query being executed would look like this: SQL SELECT sfid, name, industry FROM salesforce.account WHERE industry IN (SELECT name FROM salesforce.industries WHERE name != 'Software' ORDER BY embeddings <-> (SELECT embeddings FROM salesforce.industries WHERE name = 'Software') LIMIT 3) ORDER BY name; From there, I built the AccountsService class to interact with the JPA repositories: Java @RequiredArgsConstructor @Service public class AccountsService { private final AccountsRepository accountsRepository; private final IndustriesRepository industriesRepository; public Set<Account> getAccountsBySimilarIndustry(String industry, int limit) throws Exception { List<Industry> industries = industriesRepository.findAll(); if (industries .stream() .map(Industry::getName) .anyMatch(industry::equals)) { return accountsRepository .findSimilaritiesForIndustry(industry, limit); } else { throw new Exception( "Could not locate '" + industry + "' industry"); } } } Lastly, I had the AccountsController class provide a RESTful entry point and connect to the AccountsService: Java @RequiredArgsConstructor @RestController @RequestMapping(value = "/accounts") public class AccountsController { private final AccountsService accountsService; @GetMapping(value = "/similarities") public ResponseEntity<Set<Account>> getAccountsBySimilarIndustry(@RequestParam String industry, @RequestParam int limit) { try { return new ResponseEntity<>( accountsService .getAccountsBySimilarIndustry(industry, limit), HttpStatus.OK); } catch (Exception e) { return new ResponseEntity<>(HttpStatus.NOT_FOUND); } } } Deploy to Heroku With the Spring Boot service ready, I added the following Procfile to the project, letting Heroku know more about our service: Shell web: java $JAVA_OPTS -Dserver.port=$PORT -jar target/*.jar To be safe, I added the system.properties file to specify what versions of Java and Maven are expected: Properties files java.runtime.version=17 maven.version=3.9.5 Using the Heroku CLI, I added a remote to my GitLab repository for the similarity-search-sfdc service to the Heroku platform: Shell heroku git:remote -a similarity-search-sfdc I also set the buildpack type for the similarity-search-sfdc service via the following command: Shell heroku buildpacks:set https://github.com/heroku/heroku-buildpack-java Finally, I deployed the similarity-search-sfdc service to Heroku using the following command: Shell git push heroku Now, the Resources tab for the similarity-search-sfdc app appeared as shown below: Similarity Search in Action With the RESTful API running, I issued the following cURL command to locate the top three Salesforce industries (and associated accounts) that are closest to the Software industry: Shell curl --location 'https://HEROKU-APP-ROOT-URL/accounts/similarities?industry=Software&limit=3' The RESTful API returns a 200 OK HTTP response status along with the following payload: JSON [ { "id": "001Kd00001bsP80IAE", "name": "CleanSlate Technology Group", "industry": "Technology" }, { "id": "001Kd00001bsPBFIA2", "name": "CMG Worldwide", "industry": "Media" }, { "id": "001Kd00001bsP8AIAU", "name": "Dev Spotlight", "industry": "Technology" }, { "id": "001Kd00001bsP8hIAE", "name": "Egghead", "industry": "Electronics" }, { "id": "001Kd00001bsP85IAE", "name": "Marqeta", "industry": "Technology" } ] As a result, the Technology, Media, and Electronics industries are the closest industries to the Software industry in this example. Now, the marketing department has a list of accounts they can contact for their next campaign. Conclusion Years ago, I spent more time than I would like to admit playing the Team Fortress 2 multiplayer video game. Here’s a screenshot from an event back in 2012 that was a lot of fun: Those familiar with this aspect of my life could tell you that my default choice of player class was the soldier. This is because the soldier has the best balance of health, movement, speed, and firepower. I feel like software engineers are the “soldier class” of the real world, because we can adapt to any situation and focus on providing solutions that meet expectations in an efficient manner. For a few years now, I have been focused on the following mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” - J. Vester In the example for this post, we were able to leverage Heroku Connect to synchronize enterprise data with a Postgres database. After installing the pgvector extension, we created word vectors for each unique industry from those Salesforce accounts. Finally, we introduced a Spring Boot service, which simplified the process of locating Salesforce accounts whose industry was closest to another industry. We solved this use case quickly with existing open-source technologies, the addition of a tiny Spring Boot service, and the Heroku PaaS — fully adhering to my mission statement. I cannot imagine how much time would be required without these frameworks, products, and services. If you’re interested, you can find the original source code for this article on GitLab. Have a really great day!
Kai Wähner
Technology Evangelist,
Confluent
Gilad David Maayan
CEO,
Agile SEO
Fawaz Ghali, PhD
Principal Data Architect and Head of Developer Relations,
Hazelcast