Big Data/Analytics Zone is brought to you in partnership with:

Jesse has posted 3 posts at DZone. View Full User Profile

Moving Mountains of Data without Sacrificing Performance

04.28.2010
| 13839 views |
  • submit to reddit

With the increased merging of disparate core business systems in the enterprise - as well as the emergence of additional systems in the form of enterprise resource management, customer relationship management, hierarchical storage strategies, and other business-driven initiatives - many companies today find themselves moving mountains of data on a daily basis. Business intelligence (BI) initiatives in particular typically rely on data warehousing strategies to provide critical information and reports to management in support of business decisions. Such strategies often require the timely transfer of enormous amounts of data from line-of-business systems. Too much time taken in data transfer can adversely impact a company's agility and could mean lost windows of business opportunities. It can also encroach on processing resources better devoted to core business applications.

Of course, sufficient memory and processing resources, high-performance network infrastructure, and optimized applications and databases all play a part in keeping data moving as quickly as possible. But one often overlooked area that can play a powerful role in accelerating data transfer lies in the data connectivity and integration middleware that provides access to data stores for applications and other databases.

 

Up Against the Window

To illustrate, consider a large Wall Street financial company that runs an online trading application. They move data from that application to a relational database in order to generate dashboard reports to management twice a day. The trading application - and the twice daily report - are an essential part of the company's core business as management makes critical business decisions based on these reports.

So how do they load the trading data into their BI data warehouse? They have their log files in the standard Extended Log File (ELF) format coming out of an IIS or maybe an Apache server, and they must get the data from the file into their data warehouse- Oracle, for example. What they use is plain vanilla JDBC standard batch code they have developed: it shreds the data through some File I/O, prepares statements, sets each of the appropriate parameters, and executes the batch. This runs each night and once during each day in order to capture the data about every 12 hours.

This approach works well for the IT staff for a time: they like using the standard code and have members accustomed to working with it. But the trading application takes off in popularity and the rapidly increasing data load begins to press the staff up against their deadlines for delivering the dashboards, due to the increased time it takes to get the data into the Oracle database - thereby jeopardizing timely and critical reports to management.

 

Under the Covers

The data connectivity layer - the middleware providing connectivity across application and database platforms via standard API-based methods such as ODBC, JDBC, and ADO.NET - offers the ideal platform to implement technology advancements that can greatly assist in moving very large amounts of data into data warehouses and other systems. Some vendors specializing in data integration and connectivity software (including ODBC, JDBC, and ADO.NET drivers) have been making innovative strides in accelerating data movement with bulk load operations.

Let's take a closer look under the covers of that Wall Street data warehousing operation to examine the mechanism of the bottleneck they've run up against. What their approach does is perform a series of execute batch operations - one group of data at a time separated by Java execute Batch() statement calls. As those batch loads are going over to it, the database has a lot of work to do for each one of them in inserting the data - operations such as validating not null values against column definitions, checking for integrity constraints, etc. - all of which takes time.

With bulk load functionality as implemented by more sophisticated JDBC drivers, the protocol used allows the process to bypass some of these operations and send data of much greater amounts at one time. These drivers, which communicate directly in the native "wire" protocols of the target databases, can as much as double the performance of data load operations, reducing the load time by half. So a data load of 13 million database rows that traditionally would take 6 hours to complete would, using bulk load, require only 3. This is accomplished with a simple connection-string change with no change to the driver code or the batch code in the application.

In the use case of the Wall Street finance firm, that kind of performance gain not only completely solves the deadline crisis threatening their BI reporting; it frees up a good deal of time for analysis of the processed data, giving the company a competitive edge in their operational responsiveness.

 

Streamlining Multiple Database Architecture

The use case example given here is an unusually simple one in terms of the heterogeneous IT environments prevalent in businesses today. It presumes standardization on a single and unchanging database platform. The real-world enterprise is seldom so neat; typically several different databases are involved in data transfer. Even where a company uses a single database for data warehousing, any number of factors can change that. A company standardized on DB2 today may have been standardized on SQL Server 10 years ago. It may be purchased by another company that requires standardization on another database platform altogether. Additionally, even if the data warehouse is a single database the chances of that being the only database used in the company are slim; most likely a different database is used for marketing, sales, and executive management.

The reality is that where your data is stored can and probably will change in the future. A challenge here is that, with traditional bulk data load, there has been no cross-platform standard comparable to standards such as ODBC or JDBC batches. Here again, some data connectivity middleware vendors are stepping up to fill this need.

For example, in a scenario where you have four different file formats that you must be able to import into your applications, you have four components taking care of your database-specific SQL loader or Bulk Copy Program (BCP) code, and you have the underlying tools to configure for each in order to get the data into the database. Besides being cumbersome and costly to maintain, this approach involves four separate points of potential failure. You want to get rid of that code, eliminate the different formats, and instead simplify your architecture both for improved performance and reduced costs of operation.

Vendors of state-of-the-art wire protocol connectivity solutions can provide a common file format and a single API coding approach for multiple database platforms, streamlining the data load process.

Provided the vendor offers comprehensive support for a wide array of connectivity standards, the standard file format they provide can also work across ODBC, JDBC, and ADO.NET drivers. The wire protocol technology eliminates cross-vendor dependence on client tooling for loading of bulk data.

The same vendor can likely provide alternate solutions using this technology, while supporting existing file formats (which may be a requirement of the organization). In this case, file formats and data types are transformed into a common standard CSV format for bulk load through the drivers.

Another rapid-transfer method of bulk load is available to data centers using a cutting-edge wire protocol data connectivity solution, which might be termed "bulk streaming." Say you have some data in a DB2 database and want to get it into an Oracle database as quickly as possible, with no need to store the data an a standard CSV format. What you could do is issue a Select statement and get a standard ResultSet from JDBC - it can be any ResultSet; you can select certain columns and tailor the data you want to insert into Oracle. You'd then pass the ResultSet to the load method on the DB2 load object.

In this scenario, what is basically happening here is that you are converting the ResultSet into a bulk wire protocol. So it's something like a wire protocol conversion: you're taking the individual pieces of data from the ResultSet and constructing that into a data stream that can be transferred via the wire protocol to get it directly into Oracle.

 

Going Where Bulk Load has Never Gone Before

Truly comprehensive data integration and connectivity vendors support data movement across all platforms, including mainframe data stores. Provided you're using technology from such a vendor that also employs wire protocol solutions, you can apply such bulk load techniques to pull even non-relational data from mainframe into a relational data warehouse. Say you have a key application in your organization that stores its data in a VSAM file on the mainframe, but all your BI dashboards and reporting analysis run against Oracle. How do you unlock that crucial data and make it available to your analysis staff to operate on that data using their accustomed tools?

You could deploy an event-driven scenario using technology available from some of the more sophisticated middleware vendors to trigger a bulk load into Oracle from the VSAM file whenever a given set of conditions is met. It performs a single Select statement from the VSAM file, passes the ResultSet to the load method on a bulk load object for the Oracle database, and the data is inserted. If you imagine it would take an awful lot of coding to accomplish this, you'd be mistaken - a mere five lines of code can accomplish the task of pulling data out of VSAM and bulk load it into an Oracle database.

As you've gleaned from this article, exciting developments are taking place in data connectivity and integration middleware technology. Partnering with a reputable vendor specializing in this technology can easily accelerate your organization's data load processes with considerable benefits to both your business operations and your bottom line. You can apply cutting-edge wire protocol, data conversion, and mainframe integration solutions to any number of different scenarios, and can utilize bulk wire protocol where you've never had any data access at all, in systems that may have been running critical business applications for years. This often overlooked aspect of systems integration richly deserves a good look.

 

About the Author

As senior engineering manager for the Connect and Shadow z/Direct product lines from Progress DataDirect, Jesse Davis is responsible for DataDirect's future product development initiatives in these areas. Jesse has lead development efforts for a number of DataDirect Connect products, and has developed core modules for the Connect for ODBC and Connect for JDBC product groups. His Twitter ID is @jldavis007.

References
Published at DZone with permission of its author, Jesse Davis. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)