hw08-ec-web
extra-credit: latinizing web pages

Challenge Extra credit: Write a method which can translate a web page into pig latin.
It takes two Strings as input — the URL to read from, and the name of a file to write the translated html to.
(You'll be able to open that file from a browser, and view your result!)
(This is an extra-credit followup to hw08—Translating many words.)

It turns out, having a scanner read from a web page (rather than System.in) is easy. However, web pages are more than just words of text; web pages contain markup to indicate the structure of the document (where emphasized text should start and end, where to insert horizontal lines, etc.). Thus, web pages are written in “HTML” — “hypertext markup language”. We need to translate the regular information into pig latin, but leave this markup information untouched.

More information you need is here.

This requires some knowledge about

how to make a Scanner which reads from a web page,
how to write to a file instead of to a the console.

We'll mention those below.

if a word starts with an ampersand (“&”), then don't turn it into pig latin. (Html uses such words -- “entities” -- to indicate special characters; for instance, “&emdash;”
If a word starts with an open-angle-bracket (“<”), don't translate that word into Pig Latin, nor any following word up to and including a word which ends in a close-angle-bracket (“>”). In html, such words indicate markup information.
prompt the user for a URL. (You can presume the URL is valid; it's okay for your program to crash if it's not.)
Prompt the user for the name of the output file.

As promised, here's some library-specific information:

To create a Scanner which reads from (say) the RU home page rather than from the keyboard System.in,
java.util.Scanner s = new java.util.Scanner( new java.net.URL("http://www.radford.edu/").openStream() );
Note that as before, the Scanner method hasNext will always return true as long as there is more input to read; it only returns false once the entire web page has been read.

To write to a file instead of the console window System.out,

java.io.PrintStream myOut = new java.io.PrintStream( new java.io.File( "H:/oinkayOinkay.html" ) );
// Now, you can say:
myOut.println("hello");

// Before our program quits, we must close the file¹:
myOut.close();

Important: in order for either of the above two to compile, we need to add some information about Exceptions (errors); exceptions will be discussed further in ITEC220.
throws java.net.MalformedURLException, java.io.FileNotFoundException, java.io.IOException
(More on this coming soon.)
To actually see your translated page, you'll need to use a web browser, select File > Open File…, and open the disk-file your program just printed its output to.

A final note: our html-processing is oblivous to the actual structure of the markup. A proper approach would be more sophisticated, reading the structure of the markup, and then process the resulting tree.

¹Well, technically, it's the java.io.PrintStream which we must close. ↩

home—info—labs—hws—exams
textbook—java.lang docs—java.util docs—archive

©2008, Ian Barland, Radford University
Last modified 2008.Nov.04 (Tue) Please mail any suggestions
(incl. typos, broken links)
to iba�rlandrad�ford.edu

hw08-ec-webextra-credit: latinizing web pages

hw08-ec-web
extra-credit: latinizing web pages