Friday, November 03, 2006

Reading Files With ColdFusion, or Java???

One project I was on, not to long ago, consisted of transferring a datafile between 2 systems. Unfortunately, this datafile wasn't XML. It was a pipped delimited file of product records. The main purpose of this file was to sync up data between the two systems. Therefore, I needed to read the file, line by line, parse out each field of the file, and compare it with current records in the database. Since the language of choice for this particular project was ColdFusion, I was encouraged to keep with the standard.

Being a Java guy, with some php and perl background, I knew I could adapt. But how can I make this as efficient as possible. The app server we were using was ColdFusion version 6.1. So, I know I had options. At least it wasn't version 5 or below, like some other organizations who shall remain nameless.

Anyway, ColdFusion has a tag called <cffile>. This tag is nice and easy to use. However, it has one main drawback to it. It holds the entire file in memory at once. It might not seem like a problem, but I am talking about somewhere around 200,000 records. With each record containing 80 fields. That can make the application more prone to memory errors. Especially, if this is a user requested job, and not a scheduled job.

However, the standard Java development kit has the FileReader, used with a BufferedReader, that can stream the file, line by line. So, I figured I'd give it a try. I started by comparing the two methods of file reading. I wrote a sample ColdFusion page which would read the file in and compare the time it takes to read/process the file using the <CFFile>, and the Java FileReader methods. I started with a small file, containing 200 records. Then moved up to a file with 25,000 records. The code is as follows:
Using cffile:
<cfset filePath = "./DataFile200Records.txt">
<cffile action="read" file="#filePath#" variable="fileContents">

Using CFFile<br>
<cfoutput>
Start:#now()#<br>
<cfloop index="line" list="#fileContents#" delimiters="#Chr(10)#">
<!--- Account for empty values --->
<cfset newline = " " & Replace(line, "||", " | | ", "all") & " ">
<!--- Convert the pipped delimited list to an Array --->
<cfset fieldArray = listtoarray(newline,"|")>
<cfloop index="i" from="1" to="#arraylen(fieldArray)#">
<!--- Loop through each field --->
<!-- #fieldArray[i]# *** -->
</cfloop>
</cfloop>
End:#now()#<br> <!--- End Timestamp --->
</cfoutput>

Using Java FileReader:
Start:<cfoutput >#now()#</cfoutput><br><!--- Start Timestamp --->
<cfscript>
// create a FileReader and BufferedReader objects
fileReader = createObject("java","java.io.FileReader");
buffReader = createObject("java","java.io.BufferedReader");

// Instantiate the FileReader with the file path
fileReader.init(filePath);

// Instantiate the BufferedReader with the fileReader Object
buffReader.init(fileReader);

// read the first line into a String
fileLine = buffReader.readLine();

// Loop while the String is defined
while (isDefined("fileLine")) {
// Account for empty values
newline = " " & Replace(fileLine, "||", " | | ", "all") &amp;amp;amp;amp; " ";
// Convert the pipped delimited list to an Array
fieldArray = listtoarray(newline,"|");
for (i = 1; i lt arraylen(fieldArray); i = i+1)
{
// Loop through each field
writeOutput("<!--" & fieldArray[i] & " *** -->");
}
// read the next line to continue the loop
fileLine = buffReader.readLine();
}

// close the Reader objects
buffReader.close();
fileReader.close();
</cfscript>
End:<cfoutput >#now()#</cfoutput><!--- End Timestamp --->
Essentially, each section of code is doing the same thing. It reads the file, line by line. Parses the fields, including empty fields. And it loops through each field, displaying it as an HTML comment. The main difference is that the Java FileReader and BufferedReader objects have to be closed when they are no longer needed. You always want to close a Streamed object, otherwise it can take up memory.

So, what are the results? You would think that CFFile would run faster, being that the file is in memory. And memory access is usually pretty quick. Well, for the small, 200 record, file, that is correct. Below are the results:
200 RecordsUsing CFFile:
Start:{ts '2006-11-03 10:23:32'}
End:{ts '2006-11-03 10:23:33'}

Using Java FileReader
Start:{ts '2006-11-03 10:23:33'}
End:{ts '2006-11-03 10:23:35'}

Using the Java objects took a second longer. I think that may have to do with the overhead involved with opening and closing the Reader objects. But that would be fairly the same delay on other files, regardless of the size of the file.

Now, for the file with 25,000 records:
25,000 recordsUsing CFFIle:
Start:{ts '2006-11-03 10:47:46'}
End:{ts '2006-11-03 10:47:55'}

Using Java FileReader
Start:{ts '2006-11-03 10:47:55'}
End:{ts '2006-11-03 10:47:59'}

After 25,000 records, the CFFile took 9 seconds, and the Reader objects took 4 seconds. That is quite a difference. So, we ended up using the Java code, and it works well. We did some other mods, like using the <cftry> and <cfcatch> blocks to handle errors gracefully. If this was a full Java approach, I would also use the finally {} block to close the Reader objects. The moral of the story is, we have options.

No comments: