Tuesday, April 21, 2009

File Handling In perl Script

Handling input and output from/to files and the Command Prompt

As previously noted, programs consist primarily of data and instructions for manipulating the data. Earlier we looked at programs where the data was included in the program code. However, programs can also manipulate data from external sources, such as text files and the terminal.
Input/Output from/to the Terminal

There are two principle ways of getting data into and out of Perl scripts: by using the terminal and by using files. The "terminal" is basically the command prompt supplied by the operating system of your computer. Any output to the terminal is known to Perl (and to most Unix-based programming languages) as STDOUT, and any input from the terminal is known as STDIN. The most common use of STDOUT is with the print function; in fact, if you do not tell "print" where to print to, it outputs to STDOUT by default. Therefore, the two statements are equivalent:

print "Hello";
print SDTOUT "Hello";

In the scripts that we will be writing in this course, we won't be using STDIN, which is mainly useful for short amounts of unstructured data (our data tends to be highly structured, using tabs, fields, etc.). However, since it is so easy to print to STDOUT, when writing short scripts many people often print to SDTOUT and redirect the output to a file. This allows them to forgo opening and writing files from within Perl (which we will look at in a moment). To redirect STDOUT to a file, use the following syntax:

perl script.pl > captured_output.txt

The > operator is not part of Perl; rather, it is part of most operating systems including Unix, Linux, Mac OS X and Windows.

Redirecting STDOUT to a file is not always desirable. For example, if you want your script to output more than one file, redirecting is not straight forward. Also, redirecting only works well when the output is plain text. MARC communications files are binary, so we should not use redirection to create them.
Input/Output from/to Files

Perl uses the following functions to open and close files (appropriately called
"open" and "close"):

open (INPUTFILE, "$input");
close (INPUTFILE);

"INPUTFILE" is called a filehandle, and is the name that Perl uses to refer to the open file. The actual name has no significance (we could have called this one CGGG6HHH), but by convention it is in upper-case characters (but doesn't have to be). The second parameter, "$input" in this case, is a variable that contains the location of the file that is to be associated with the filehandle. In you script

It is common to include error-checking code in file open and close statements, since files can have any number of problems, such as not being there (file not found) or permissions problems. The open statement above with this type of error checking is:

open (INPUTFILE, "$input") or die ("Problem with opening $input: $!");

$! is a special Perl variable that contains the last error message, which in this case will be the fatal error produced by Perl if it can't open the identified file.

You define how you want an open file to interact with your script by assigning a "mode". The three most common modes are read, overwrite, and append, signified by <, >, and >>, respectively. The mode indicators are prepended to the location of the file. For example,

open (FILE, "<$file"); # Means open $file in read mode
open (FILE, ">$file"); # Means open $file in overwrite mode
open (FILE, ">>$file"); # Means open $file in append mode


Now, to put the pieces together. If you want to read data from a file, simply open it in read mode. The contents of the file will be added to the filehandle, which you can then manipulate in various ways. We'll describe how to read a file line at a time later.

If you want to add data to a file (called "print" the data to the file), you need to open the file in either overwrite or append mode (depending on what you want to do) and then use the print function along with the filehandle you want to print to, like this:

open (OUTPUTFILE, ">$output") or die ("Problem with opening $output: $!");
print OUTPUT "Whatever you want to add to the file"; close OUTPUT;


Notice that in this example we printed to OUTPUT just like we printed to STDOUT above. STDOUT is actually a special filehandle reserved for the terminal.

It is not usually necessary to explicitly close open files, but it is a good habit to get into.
Exercise (Spreadsheet)

To illustrate how the above works in practice, we will write a spreadsheet program.

First create the spreadsheet in Jedit:

12 4 167 17 8
9 34 4 12 1
62 14 67 0 88
78 9 34 67 5


Thats five numbers on each line, separated by tabs. They don't have to be these numbers exactly.

Now let's write a new program, spreadsheet.pl, that will add up all the numbers on each line and give us separate totals.

#this is the world's simplest spreadsheet program

open (SPREADSHEET, "spreadsheet.txt")
or die ("Problem opening file: $!");

while () {

$numbers = $_;
chomp($numbers);

@numbers = split(/\t/, $numbers);

$total = 0;

foreach $number (@numbers){
$total = $total + $number;
}

print $total . "\n";

}

close (SPREADSHEET);


How it Works

As you can see, this program re-uses a lot of the same logic that we covered with hello.pl, and yet what it does is very different.

while means 'while whatever is between the brackets returns a value, do whatever is between the curly braces'. The standard way that perl reads files is one line at a time. So the instructions between the braces gets repeated for every line in the file.

As lines from the file are read into the program, perl doesn't really know what to call them. So it assigns them the variable name $_, for lack of anything more helpful. You could continue to refer to them that way, but it makes the rest of the program more readable to give the incoming lines a more descriptive variable name. So they are assigned here to the $numbers variable, using the assignment operator, =.

Other things we have not yet seen are the addition operator, +, and the split instruction. The split instruction splits data into chunks on whatever you specify as the separator, and stores the resulting values in an array. In this case, we've told split to split the line using the tab character as the separator, represented by \t.

After we've used the split instruction to populate the @numbers array, we then use foreach to run through each number in the array, adding it to the $total variable. This is an example of a nested loop (the foreach loop is nested within the while loop, which means the entire foreach loop is run for every iteration of the while loop. There is no limit as to how deeply you can nest loops - theoretically you can have loops within loops within loops ad infinitum. However, too much nesting will really slow down your script.

No comments:

Post a Comment