ardixon
6/19/2013 - 10:26 PM

ExcelConverter creates either a tab-delimited, unicode tab-delimited, csv, or xml file from a user selected excel spreadsheet.

ExcelConverter creates either a tab-delimited, unicode tab-delimited, csv, or xml file from a user selected excel spreadsheet.

In Digital Services, we worked a lot with excel files, but there were times in our workflow when we needed to convert an excel file to a different format. For instance, we used Archivist Utilities to generate mods from the metadata, but Archivist Utilities can only upload tab-delimited files and the metadata is an excel file.

Before I started working there, the workflow for doing this consisted of opening the file in excel and using excel to convert the file to tab-delimited. This creates problems though, because when excel converts a file to a different file format, it takes any character within a cell that can be used as a delimiter (things like commas and quotations) and treats those characters as delimiters, so everything in those cells gets put into quotations. It also puts quotations around cells that contain any sort of diacritic. What you get is a file full of un-wanted quotations mixed in with the good quotations. There is no easy workaround for this, so we would then have to open the converted file in Notepad ++, and use Notepad to delete all quotation marks. Finally, we would go back and add the quotations that were supposed to be in the file.

This process was time consuming and left a lot of room for human error. So I decided to create a script that makes this process efficient and removes the possibility of error. What I came up with was Excel Converter.

 Excel Converter allows a user to convert any excel file to either a tab-delimited, unicode tab-delimited, comma-separated-values, or xml file. In our workflow, we mainly used the unicode tab-delimited option. That option also parses and removes only the unwanted quotes added by the Excel export.  All of the *good* quotes within a cell are retained.
 
Using Perl, I wrote the part of the script that converts the excel file to other formats. I also wrote a graphic user interface for it using TKx. The user interface allows the user to choose which file to convert, where to save the new file, and what file type to save it as.

Now looking under the hood, the script utilizes XlFileFormat Enumeration, which is a numerical value Microsoft uses to specify file type. For example, excel sees the number -4158 as a tab-delimited file. Knowing these values makes it easy to convert excel files to other file types. We could also add more file types in the future if our workflow ever called for it. Jeremiah’s contribution was to add a routine to the unicode tab-delimited option that eliminates single quotes, and then looks for double quotes and turns them into single quotes, thus eliminating the un-wanted quotations. It also does a bit of conditional cleaning to the lines and tweaks the encoding so it worked perfectly for our specific purposes. 
##  Copyright (c) 2013, The University of Alabama Libraries.
##  Contributed by {Austin Dixon}  {5/13/2013}.
##  All rights reserved.
 
##  Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 
##    * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
##    * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the
##       distribution.
##    * Neither the name of The University of Alabama Libraries nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
 
##THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
##THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
##CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
##PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
##LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
##EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.




# script creates either an tab-delimited, unicode tab-delimited, csv, or xml file from a user selected excel spreadsheet
# Unicode option also parses and removes only the unwanted quotes added by the Excel export. the *good* quotes within a cell are retained
# Unicode option now re-encodes the exported UCS-2 16bit little endian into standard UTF-8 without BOM
# Excel filename is now captured for use in nameing processed files for final export. this is implemented in the unicode export option as the default filename (06/10/2013)

#MUST create a folder named OUTPUT in the top level of the C drive before running script


#! usr/local/bin/perl
use Tkx;
use LWP::Simple;
use File::Copy;
use Win32::OLE qw(in with);
BEGIN { Win32::OLE->Initialize( Win32::OLE::COINIT_OLEINITIALIZE() ) }
use Win32::OLE::Const 'Microsoft Excel';
use Win32::OLE::Variant;
use Win32::OLE::NLS qw(:LOCALE :DATE);

# SETUP
													
# frame objects
my $fa;
my $fb;
my $fc;
# label objects
my $la;
my $lb;
my $lc;
my $ld;
my $le;
# button objects
my $ba;
my $bb;
my $bc;
my $bd;
my $bd;
my $be;
my $bf;

my $batch_filename = ''; # var handle for holding the name of the properly formed (hopefully) excel batch file for use when saving the cleaned and converted finished file

my $choice = 4; # default filetype (unicode)

#-------------------------------------------------
# convert button sub routine

sub chooseFile {
$url = Tkx::tk___getOpenFile();
$bc->configure( -state => 'active');
} 

#-------------------------------------------------
# file convert sub routine

sub runFileConverter {

if (substr($new_url, 2, 0) == '/') # determines if user selected address uses forward or backward slashes
{
	$s = '/';
}
else
{
	$s = '\\';
}

$Win32::OLE::Warn = 3; # Die on Errors.

my $Excel = Win32::OLE->GetActiveObject('Excel.Application')
            || Win32::OLE->new('Excel.Application', 'Quit');

$Excel->{DisplayAlerts}=0;

@url_parts = split(/$s/, $url);							# split excel file path on seperator and places pieces of the path into an array. $url path string remains unchanged
$batch_filename = @url_parts[-1];						# assign last portion of the excel file path (the filename and extention) to the batch_filename var

my $excel_file = $url;
my $workbook = $Excel->Workbooks->Open($excel_file);


if ($choice == 1) # tab-delimited
{
	my $saveas_file = 'C:\OUTPUT\tab_temp.txt';

	$workbook->SaveAs($saveas_file, -4158);				# see XlFileFormat Enumeration for more info
	sleep(4);
	#-------
	my $t = 'tab.txt';
	my $old_file = $saveas_file;
	my $new_file = $new_url . $s . $t;  				# builds new directory path
	
	copy($old_file,$new_file) or die "Copy failed: $!";	# renames (moves) files
}

elsif ($choice == 2) # csv
{
	my $saveas_file = 'C:\OUTPUT\csv_temp.txt';

	$workbook->SaveAs($saveas_file, 6);					# see XlFileFormat Enumeration for more info
	sleep(4);
	#-------
	my $c = 'csv.txt';
	my $old_file = $saveas_file;
	my $new_file = $new_url . $s . $c; 	  				# builds new directory path

	copy($old_file,$new_file) or die "Copy failed: $!";	# renames (moves) files
}

elsif ($choice == 3) # xml
{
	my $saveas_file = 'C:\OUTPUT\xml_temp.xml';

	$workbook->SaveAs($saveas_file, 46);				# see XlFileFormat Enumeration for more info
	sleep(4);
	#-------
	my $x = 'xml.xml';
	my $old_file = $saveas_file;
	my $new_file = $new_url . $s . $x;		  			# builds new directory path

	copy($old_file,$new_file) or die "Copy failed: $!";	# renames (moves) files

}

elsif ($choice == 4) 												# unicode (UCS-2 Little-Endian) (tab-delimited)
{
	my $saveas_file = 'C:\OUTPUT\uni_temp.txt';

	$workbook->SaveAs($saveas_file, 42);							# see XlFileFormat Enumeration for more info
	sleep(4);														# pause for file saving
	#-------
	
	open(my $rawbatchexport, "<", $saveas_file) or die; 			# opens the excel unicode text export file (in read mode) so we can clean up all that dirty quoted filth
	binmode($rawbatchexport, ":encoding(UTF-16LE)");				# change binary encoding mode to reflect the files UCS-2 (UTF-16LE) encoding
	
	$batch_filename =~ s/xlsx/txt/g;								# substitute excel extention for text extention
	my $new_file = $new_url . $s . $batch_filename;					# builds new directory path

	open(my $cleanbatchfile, ">>", $new_file) or die; 				# open a new file (in append mode) for printing the cleaned up lines into
	binmode($cleanbatchfile, ":encoding(UTF-8)");					# change binary encoding mode of the output file to UTF-8 (which by default is "without BOM")
	
	while ($line = <$rawbatchexport>) {								# while loop that read each line of a file
		chomp($line);												# removes the LF
		chop($line);												# removes the CR because it is the last char of the string
		$line =~ s/(?!"")"//g; 										# a substitution of nil for characters that satisfy this match will remove all stand alone double-quotes and reduce sequences of 3 to 2, and sequences of 2 to 1.
		$line =~ s/""/"/g; 											# a follow up substitution regex will replace all remaining sequences of 2 double-quotes with 1
		$line =~ s/$[\r]//;											# removes any end-of-line CR added in the substitution process from previous matches 
		$line = $line . "\n";										# add the complete record seperator
		print $cleanbatchfile $line;								# write the cleaned line of metadata to the new file
	}
	
	close $rawbatchexport or die;									# close dirty excel unicode export
	unlink $saveas_file;											# delete old export
	close $cleanbatchfile or die;									# close cleaned up file

}
}

#-------------------------------------------------
# GUI elements

# main window
my $mw = Tkx::widget->new(".");
$mw->g_wm_title("Excel Converter");
$mw->g_wm_minsize(150, 130);

# frame a
$fa = $mw->new_frame(
-relief => 'solid',
-borderwidth => 1,
-background => 'light gray',
);
$fa->g_pack( -side => 'left', -fill => 'both' );

#---------------------------------------------------
# choose file

$la = $fa->new_label(
-text => 'Choose File to Scan:',
-font => 'bold',
-bg => 'light gray',
-foreground => 'black',
);
$la->g_pack( -side => 'top', -fill => 'both' );

$lb = $fa->new_label(
-bg => 'blue',
-foreground => 'cyan',
-width => 28,
-textvariable => \$url,
);
$lb->g_pack( -side => 'top' );

$ba = $fa->new_button(
-text => 'Choose',
-command => \&chooseFile,
-height => 1,
-width => 15,
);
$ba->g_pack( -side => 'top', -pady => 5 );

#---------------------------------------------------
# choose directory

$ld = $fa->new_label(
-text => '   Choose Where To Save File:   ',
-font => 'bold',
-bg => 'light gray',
);
$ld->g_pack( -side => 'top', -fill => 'both' );

$le = $fa->new_label(
-bg => 'blue',
-foreground => 'cyan',
-width => 28,
-textvariable => \$new_url,
);
$le->g_pack( -side => 'top' );

$bf = $fa->new_button(
-text => 'Choose',
-command => sub {$new_url = Tkx::tk___chooseDirectory();},
-height => 1,
-width => 15,
);
$bf->g_pack( -side => 'top', -pady => 5 );

#------------------------------------------------
# convert button

$bc = $fa->new_button(
-borderwidth => 1,
-text => 'Convert File!',
-font => 'bold',
-command => \&runFileConverter,
-state => 'disabled',
-height => 2,
-width => 15,
);
$bc->g_pack( -side => 'bottom', -pady => 10 );

#------------------------------------------------
# frame b (choose filetype)

$fb = $mw->new_frame(
-relief => 'solid',
-borderwidth => 1,
-background => 'light gray',
);
$fb->g_pack( -side => 'right', -fill => 'both' );

$lb = $fb->new_label(
-text => 'Choose Filetype:',
-font => 'bold',
-bg => 'light gray',
-foreground => 'black',
);
$lb->g_pack( -side => 'top', -fill => 'both' );

$bb = $fb->new_radiobutton(
-bg => 'light gray',
-text => "Tab-Delimited", 
-variable => \$choice, 
-value => 1,	
);
$bb->g_pack( -anchor => w, -side => 'top', -pady => 4 );

$bd = $fb->new_radiobutton(
-bg => 'light gray',
-text => "CSV", 
-variable => \$choice, 
-value => 2,	
);
$bd->g_pack(  -anchor => w, -side => 'top', -pady => 4);

$be = $fb->new_radiobutton(
-bg => 'light gray',
-text => "XML", 
-variable => \$choice, 
-value => 3,	
);
$be->g_pack(  -anchor => w, -side => 'top', -pady => 4 );

$bf = $fb->new_radiobutton(
-bg => 'light gray',
-text => "Unicode", 
-variable => \$choice, 
-value => 4,	
);
$bf->g_pack(  -anchor => w, -side => 'top', -pady => 4 );

Tkx::MainLoop();