As any database or data warehouse administrator will tell you, most of the required consistency checks are complex. Most copy management tools do not have the capability of applying these checks directly. Specifically, a number of products that claim to have this facility implement it by allowing users to code the logic in either SQL, stored procedures. or their own programming language. What this means is that a substantial amount of development time is spent just implementing this logic. This problem will leads one to recommend that you investigate the cost-benefit of using a copy management tool, prior to making the purchasing decision. If the source systems do not overlap much, and the consistency checks are simplistic, a copy management tool will cut down the coding effort required. If this is not the case, a copy management tool may not add sufficient value to justify purchase.
To prevent errors, make sure that when you take a row of data and examine it, the contents of the row must make sense. Errors at this point are mainly to do with errors in the source systems. Typical checks are for nonsensical phone numbers, addresses, counts, and so on. When you examine the data against other tables within the same source, the data must make sense. For example, one can check for the existence of the stock-keeping unit (SKU)/customer/service specified in the transaction, by comparing it with the list of valid SK Us/customers/services.
August 20th, 2008 | Posted in General | No Comments
Once the data is extracted from the source systems, it is then typically loaded into a temporary data store in order for it to be cleaned up and made consistent. These checks can be quite complex, and identify consistency issues when integrating data from a number of data sources. In addition, as data changes over time, errors become apparent that have gone unnoticed because the day-to-day discrepancies were too small to detect. If 50 customers no longer appear in the customer details database week on week, we would expect to find 50 cijstomer events representing the cancellation of those subscriptions.
If this information does not appear in the customer events area of the data warehouse, the user can quite rightly expect this to be a significant inconsistency. In practice, the likelihood of this error occurring can be quite high, because the two source systems have significantly overlapping data sets. The data warehouse probably is the first time that consistency issues between the two separate systems become apparent. Put another way, if two source systems have overlapping data sets, the effort required to clean them both up will be much higher than twice the effort it takes to clean one up. In addition, the process must be capable of fully automatic running: that is, it has the intelligence to report errors in the load and move on, and/or request human intervention. Care should be taken when designing the load process to ensure that the error recovery is an integral part of the design.
August 6th, 2008 | Posted in General | No Comments
Data should be in a consistent state when it is extracted from the source system. More to the point, the information in a data warehouse represents a snapshot of corporate information, so that the user is looking at a single, consistent, version of the truth. Source data should be extracted only at a point where it represents the same instance of time as the extracts from the other data sources. For example, in a customer profiling data warehouse in the telecommunications sector (telco), it is illogical to merge the list of customers at 7 pm on Friday from a customer database with the customer subscription events up to 7 pm on Thursday from a customer events database.
This would mean that we are likely to find customers for whom there are no associated subscriptions. This problem can be more complex, if the point at which operational data represents the same time period differs between systems. Continuing with the previous example, it is very possible that the customer database may not be updated to reflect new subscriptions until 3 hours after the close of business on the customer events system. This is why we typically have to wait until all the data sources are in line before we begin executing consistency checks. In order to ensure that the various tools, logic modules, and programs are executed in the correct sequence and at the correct time, a controlling mechanism is required to fire each module when appropriate.
July 31st, 2008 | Posted in General | No Comments
Data extraction takes data from source systems and makes it available to the data warehouse; data load takes extracted data and loads it into the data warehouse. Data in operational systems is held in a form suitable for that system. When we extract data from a physical database, whatever form it is held in, the original information content will have been modified and extended over the years, in order to support the data/performance requirements of the operational system. Before loading the data into the data warehouse this information content must be reconstructed.
In essence, information can be defined as data with context and meaning. The data warehouse extract and load process must take data and add context and meaning in order to convert it into value-adding business information. Within a data warehouse, this is achieved by extracting the data from the source systems, loading it into the database, stripping out any detail that is there to support the operational system rather than the business requirement, adding more context (that is, more reference data), and then reconciling the data with the other sources.
July 20th, 2008 | Posted in General | No Comments
The processes required to populate the warehouse focus on extracting the data, cleaning it up, and making it available for analysis. This is typically done on a daily basis after the close of the business day. A common misconception is that data warehouses are read-only systems. In fact, one of the key challenges for a data warehouse is the daily load and management of new data. It is, however, true that the factual data, once loaded, is usually not updated, but reference information will change on an ongoing basis as new requirements are identified to analyze the factual data in different ways.
The day-to-day management of the data warehouse is different from the management of an operational system, because the volumes can be much larger, and require more active management, such as creating and deleting summaries, or rolling data on and off the archive. In essence, a data warehouse is a database that is continually changing to satisfy new business requirements. Requirements evolution tends to be the most complex aspect of a data warehouse. This requires the architecture to be structured in such a way as to cope with future changes in query profiles. This evolution will also encompass the addition of completely new subject areas. In practice, a critical issue to address up front is how big the data warehouse will eventually be. The answer to this question indicates the magnitude of the total solution, and how much headroom is required from the underlying hardware and software.
July 5th, 2008 | Posted in General | No Comments
The most important aspect of the delivery process is that the requirements are never static. Business requirements will constantly change during the life of the data warehouse, so it is imperative that the process supports this, and allows these changes to be reflected within the system. This issue can appear to be a very complex one, but in practice, it is addressed by designing the data warehouse around the use of data within business processes, as opposed to the data requirements of existing queries. The architecture is designed to change and grow to match changing business needs. However, it is critical that the changing requirements are captured and managed through an ongoing requirements capture activity.
In effect, the process operates as a pseudo joint application development process, where new user requirements are continually fed into the development activities. Partial deliverables are produced, fed back to users, and then reworked, ensuring that the overall system is continually updated to meet the business needs. More often than not, this involves the loading of additional data sources into the data warehouse, although it can just be the introduction of new data marts using the existing information. In either case, the effort and complexity can be substantial, and the phase should be performed as a separate phase.
June 29th, 2008 | Posted in General | No Comments
The next step in the process is where we configure an ad hoc query tool to operate against the data warehouse. These end-user access tools are capable of automatically generating the database query that answers any question posed by the user. The users will typically pose questions in terms that they are familiar with (for example, sales by store last week); this is converted into the database query by the access tool, which is aware of the structure of information within the data warehouse.
The process of generating a database query may sound simple enough, but in .practice can be very tricky to get right. The challenge is not to get the user access tool to generate the query; rather, it is to cajole the tool into generating a performant query. Our experience has been that this process is time consuming, and can result in substantial changes to the database in order to force the tool to generate the optimum query. To this end, we can mitigate project risk by performing this activity within a separate phase, after the first benefits are delivered. We do not mean to imply that you should never perform this activity within one of the other phases. It is our opinion that project risk and complexity will substantially increase, if the user access tool is configured and tuned at the same time that the database is being substantially modified.
June 20th, 2008 | Posted in General | No Comments
For a successful Data Warehouse Delivery Sequence, a backload of atleast two years worth of sales history from archive tape is necessary. This would allow the business user to analyze seasonal sales trends year on year. However, the data volumes could be such that the existing relatively small data warehouse becomes a much larger one. For example, a 50 GB database could easily expand to become 400 GR. Once this happens, the operational management issues become far more complex, and require special strategies and facilities to resolve. For example, backup and recovery procedures become much more complex, disk failures increase dramatically, and load processes take much longer to execute.
In order to resolve these complex issues, we recommend that the activity to backload history is implemented within a separate phase (the history load phase). This allows you to focus on resolving the very large database (VLDB) issues without increasing the complexity even further by extending the scope of the data warehouse.
June 12th, 2008 | Posted in General | No Comments
Building the vision is the stage where the first production deliverable is produced. This is typically the smallest component of the data warehouse. that adds business benefit. For example, this stage will probably build the major infrastructure components for extracting and loading data, but limit them to the extraction and load of one or two data sources, with minimal history. To a very large extent, the purpose of minimizing the scope of this phase is to reduce project risk, and to timebox the deliverable into a 4-6-month exercise. We recommend that the deliverable satisfies your most pressing business requirement for data analysis. If the time-scales for this activity are significantly.
The remaining phases within a data warehouse delivery can occur in any sequence; that is, they are driven by the specific business drivers. As we have discussed, the purpose of breaking the delivery up into phases is to reduce project and delivery risk. This is achieved by reducing the scope of each phase so that it lasts 3 to 4 months elapsed. If the projected phase is much bigger than this, it indicates that the scope is probably too large. In most cases, the next phase is one where the remainder of the required history is loaded into the data warehouse. This means that new entities would not be added to the data warehouse, but additional physical tables would probably be created to store the increased data volumes. For example, let us consider a case where the building the vision phase has delivered a retail sales analysis data warehouse with 3 months’ worth of history. This information is probably ample to allow the business user to analyze recent trends and address short-term sales issues. It does not provide sufficient data to identify annual or seasonal sales trends.
May 27th, 2008 | Posted in General | No Comments
In order to produce a set of production-quality deliverables that grow to a full solution, the data warehouse manager must make sure that the overall requirements are understood, and that the overall system architecture is in place. The business requirements and technical blueprint phases are designed to address those two points. By understanding the business requirements for both the short and medium term, the data warehouse manager can design a solution that satisfies the short-term need, but is capable of growing to the full solution. Additionally, at least 20% of the time within the business requirements phase should be spent on understanding the likely longer-term requirement. Any less effort than that will affect the feasibility of designing a data warehouse that can grow to satisfy a future requirement.
The technical blueprint phase must deliver an overall architecture that satisfies the longer-term requirements, and a definition of the components that must be implemented in the short term in order to derive any business benefit. A detailed design of the database is not produced at this stage; rather, the most significant components are identified and sized. In practice, these will tend to be the largest tables that contain information about the business: that is, the major fact tables and associated dimensions.
May 20th, 2008 | Posted in General | No Comments