Reading and Writing

Perl Language

Vision Zero

NYC DOT's Vision Zero data

injuries: Zip GZip
fatalities: Zip GZip
Perl script

Minimum Wage

Effects of Minimum Wage (PDF)

data files: Zip Tar.GZ
Perl script

labor market research

Two fundamental tasks when assembling a dataset are reading and writing information. Here, we will read and write text files. In other contexts, we might use Perl's database interface.

But it's the comparison that we wish to make that is important. That comparison will determine how we combine and edit the data. The tasks discussed on this page – importing and exporting data – are critical, but ancillary tasks.

In our discussion of data structures, we will describe a few ways to combine datasets. And in our discussion of patterns in text, we will describe a few ways to edit the data.

But first, we must open our input files and read the data.

Here, we will explore the simple case where a comma delimits the fields, there are no commas within fields and the fields are limited to a single line. More complex cases can be handled with the Text::CSV_XS module.

As a general matter, we will open the input file, parse the data within a while loop and then close the input file. After combining and editing the data, we will print the assembled dataset to an output file.

Here, the first line of our CSV files contains the variable names, so we first open the input file, chomp the header row and split it into an array of column names:

## name of input file
my $infile = "FileName.csv" ;

## read in the header of CSV file
open( INFILE , $infile ) || die "could not open $infile" ;
chomp( my $header = <INFILE> ) ;
close INFILE ;

## split the header for use as column names
my @colnames = split( /,/ , $header ) ;

We will use those column names to identify the fields by name as we parse the input data and store it in a hash. So we create that hash:

## hash to hold input data
my %indata ;

Then, we open our CSV file once again and use a while loop to read the data, skipping the first line. After split-ting the comma-delimited fields, we store them in a "column hash" to identify them by column name.

The column hash might hold the field containing the state name in $ch{"State"} and the field containing the year in $ch{"Year"}, which we can use as indexes in %indata, the hash holding our input data:

## read in the data, skipping the header row
open( INFILE , $infile ) || die "could not open $infile" ;
<INFILE> while $. < 1 ;
while (<INFILE>) {
chomp;
my $line = $_ ;
my @cols = split( /,/ , $line ) ;

## "column hash" to identify fields by name
my %ch ;
foreach my $i (0..$#colnames) {
$ch{ $colnames[$i] } = $cols[$i] ;
}

my @varnames = ("VarOne", "VarTwo","VarTre") ;
foreach my $varname (@varnames) {
$indata{$ch{"State"}}{ $ch{"Year"}}{$varname} = $ch{$varname} ;
}
}
close INFILE ;

[an error occurred while processing this directive]

Then, after combining and editing the data, we use foreach loops to recall the data from the hash, push each field onto an array, join the elements of the array together and print it to an output file:

## name of output file
my $otfile = "OutputFile.csv" ;

## create header
my @otharray = ("state","year","VarOne", "VarTwo","VarTre") ;
my $otheader = join( "," , @otharray ) ;

## write out header
open( OTFILE , ">$otfile" ) || die "could not overwrite $otfile" ;
print OTFILE $otheader . "\n" ;

## write out data
foreach my $state ( sort keys %indata ) {
foreach my $year ( sort keys %{$indata{$state}} ) {

## create output array
my @otarray = ( $state , $year ) ;

my @varnames = ("VarOne", "VarTwo","VarTre") ;
foreach my $varname (@varnames) {
push( @otarray , $indata{$state}{$year}{$varname} ) ;
}

## prepare output string
my $otline = join( "," , @otarray ) ;

## print to output file
print OTFILE $otline . "\n" ;
}
}
close OTFILE ;

More detailed examples can be found in the Perl script for the Vision Zero analysis and the Perl script for the minimum wage analysis. Those scripts also provide examples of the data structures and patterns in text that we will discuss on the next pages.

:: previous ::
overview

table of contents

:: next ::
data structures