Blog Post

TextCleanup Perl Script for Scrubbing HTML Files

TextCleanup Perl Script for Scrubbing HTML Files

To automate some of my own digital text analysis processes, I created a little Perl program that I am sharing for other digital humanists to use to scrub HTML tags from HTML files and to create and output a new raw text .TXT file. 

Grab the program and a sample HTML page from my GitHub repository TextCleanup, and feel free to fork and modify.

Here are full instructions to use the program if you don't have knowledge of Perl it should not be a problem. When working in the Terminal be sure to press the Enter or Return key after every line you type. 

Linux distros and Mac OSX should have Perl already installed, to confirm this, open a Terminal window (if you're on a Mac, you can search for Terminal and the window should pop up), and type:

perl -v

If you don't have a read-out that confirms your version of Perl, or if you are on Windows you will have to download Perl

Since the program uses the CPAN module HTML::Restrict by Olaf Alders, you'll need to install cpan. You can first check to see if it's installed by writing 

cpan

into your terminal window. If you've never used Perl before, though, you likely will not have it, in which case you should install it by typing 

cpan App::cpanminus

Then you can install HTML::Restrict by typing

cpanm HTML::Restrict

into your terminal window. You can install any module this way, with the cpanm command and then the name of the module. Read the full instructions to download CPAN modules here

Once all of this is installed, you can download my Perl script (HTMLToCleanTxt.pl) or copy and paste it into a raw text file. You can also download the accompanying text of Alice's Adventures in Wonderland via Project Gutenberg to test it out from the TextCleanup repository. 

If you are not comfortable with navigating files in Terminal yet, you should put the script and the HTML file(s) you want to clean in the same folder of your computer. 

You will have to modify the script so that it points to the HTML file you want to clean up. You should edit the file name in line 6 of the code to the filename of your choice:

open $file, "[your file name].html" || die "Couldn't open file: $!";

That is, if your file name is shakespearesonnets.html you should change the line to read

open $file, "shakespearesonnets.html" || die "Couldn't open file: $!";

To keep track of your files, you should also change the output file name in line 17:

my $newfile = "[your file name].txt";

Again, if your original file is called shakespearesonnets.html, in order to keep track you can name the new file the same name with a .txt extension as follows:

my $newfile = "shakespearesonnets.txt";

Once you are all set with the script, you will navigate to the directory that your program is in. Here is a cheatsheet. You can also just move these files to the directory you currently have open in your Terminal window, which would most likely be your user directory. 

From the directory that the Perl program and HTML file are in, type

perl HTMLToCleanTxt.pl

If you have no errors, a new .txt file should be created with all the HTML tags removed from the .html file. If you have errors, ensure that you are in the correct directory, that everything was installed correctly and that there are no typos in the HTML file name. 

This should help to prepare texts for topic modeling, sentiment analysis, visualizations, and other digital humanities text processes. 

176

No comments