CIS Department > Tutorials > Software Design Using C++

Software Design Using C++

Web Search II (Web Search in Linux)

Introduction

This is the second of a two-part case study on web search. It is specific to Linux, and assumes that you have an account on a Linux web server, that the web server is set up to run CGI programs, and that the web server has the uncgi program available for use with the CGI programs. (The uncgi program can easily be found on the Internet.) You also need to have a C++ compiler on your Linux system. Our case study assumes that the compiler is g++, though others are likely to work as well.

If you review what we did in the first part of this case study, Web Search I, you will notice that although we managed to write Linux software that searched for web pages matching a target string, we did not have a web interface to this software. Our goal now is to show one way to create such a web interface.

Security Warning: Although having a web interface to a search engine is very desirable, since it is convenient and is used by all of the well-known Internet search engines, anytime you allow user input into a server program from anyone on the Internet, you should be very worried about the security of your program and your server. Attackers love to stuff long input strings and unexpected input (such as code for a malicious script) input your input box(es). They may thus be able to cause a buffer overflow or other, similar types of overflows. These can cause your software to crash. In the worst case, they can give the attacker complete control of your server! Thus, there is indeed reason to be concerned. To learn more about the buffer overflow problem, read the Buffer Overflow section of the Professional Programming page.

Because of this serious security concern, do not try the software below on a Linux web server that is accessible to users on the Internet. (Saint Vincent CIS students can safely do so on one particular Linux server due to a version of the uncgi program that it uses to reject unsafe user input. See Br. David for details.) Besides using a specific version of uncgi to filter out malicious input, the software written below does some filtering of user input as well. Further details are given below.

Creating a Web Interface for our Search Program

Our new search program with the web interface will work on the same weblist text file that we used in part I of this series. Recall that weblist is set up to contain the complete pathname for an HTML file on each line. Here is the short example weblist that we used before:


/www/carlson/cs125/final.html
/www/carlson/cs125/math/homework2.html
/www/carlson/cs125/math/homework3.html
/www/carlson/cs125/math/notes.html
/www/carlson/cs125/math/homework4.html
/www/carlson/cs125/math/homework5.html
/www/carlson/cs125/math/homework6.html
/www/carlson/cs125/math/homework7.html
/www/carlson/cs125/math/review.html
/www/carlson/cs125/math/answer1.html
/www/carlson/cs125/math/answer3.html
/www/carlson/cs125/math/answer4.html
/www/carlson/cs125/math/answer5.html
/www/carlson/cs125/math/math.html
/www/carlson/cs125/math/mathnotes.html
/www/carlson/cs125/menu.html
/www/carlson/cs125/hw2a.html
/www/carlson/cs125/lasthomework.html
/www/carlson/cs125/hw8.html
/www/carlson/cs125/hw4answer.html
/www/carlson/cs125/hwpractice.html
/www/carlson/cs125/Hw4.html
/www/carlson/cs125/homework1.html
/www/carlson/cs125/review1.html
/www/carlson/cs125/homework2.html
/www/carlson/cs125/review2.html
/www/carlson/cs125/hw6.html
/www/carlson/cs125/syll125.html
/www/carlson/cs125/script1.html
/www/carlson/cs125/script3.html
/www/carlson/cs125/assign.html
/www/carlson/cs125/script2.html
/www/carlson/cs125/hw1.html
/www/carlson/cs125/hw2.html
/www/carlson/cs125/hw3.html
/www/carlson/cs125/hw5.html
/www/carlson/cs125/review3.html
/www/carlson/cs125/homework3.html

You can use this sample weblist file if you wish, though it would be more interesting to use a weblist file containing real data for HTML files on your Linux web server. One way to do this is to manually edit your weblist file, typing in line by line the pathnames of various HTML files on your server. Since creating a long list of HTML files by hand is time-consuming, you might want to use the makelist script given in part I of this series.

For the web interface we will use a web page with a form where the user can enter the target string for which to search. When the user clicks the submit button on the form, a CGI (Common Gateway Interface) program will execute and carry out the search for the target string in the weblist file. Although it is typical to use a CGI script, we will use a compiled C++ program as our CGI program. A compiled CGI program (if well written) is likely to be more secure than a script.

Let's begin by examining the web page, websearch.html, shown below. There are a few more elements (such as a DOCTYPE line) that ought to be present to give good HTML, but the simplified HTML shown is adequate and will work in most browsers.


<HTML>

<HEAD>
<TITLE>Simple Web Page Search</TITLE>

<script type="text/javascript">
<!-- Hide script from browsers that cannot handle it.

function validate(thisform)
   {
   if (thisform.Key.value == "")
      {
      alert("No keyword or phrase entered.  Please try again.")
      thisform.Key.focus()
      return false
      }
   else
      return true
   }
//End Javascript-->
</script>

</HEAD>

<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#000099" VLINK="#000099" ALINK="#FF0000">

<H2>Simple Web Page Search</H2>

<UL>
   <LI>Fill in the box with a keyword or phrase for which you want to search.</LI>
   <LI>The search is not case-sensitive.</LI>
</UL>

<FORM METHOD=POST ACTION="/cgi-bin/uncgi/websearch" onSubmit="return validate(this)">
<P>
<INPUT TYPE="TEXT" NAME="Key" SIZE="36" MAXLENGTH="48">
<STRONG>Keyword or phrase</STRONG>
</P>

<P>
<INPUT TYPE="submit" VALUE="Search">
<INPUT TYPE="reset" VALUE="Clear the form">
</P>
</FORM>

</BODY>
</HTML>

We will not give an extensive presentation on HTML here. Rather, let's take a quick look at the features in websearch.html to be sure that they make sense. The HTML markup tags are in angle brackets. The ones with the forward slash generally indicate a closing tag for an earlier opening tag. Thus the entire document is enclosed between an opening <HTML> and a closing </HTML>. The HEAD section contains a title (displayed at the very top of your browser window) and one Java script function named validate. This function is not necessary. It is here as a convenience to the user for any case in which the user fails to fill out the form on this web page before clicking on the submit button. In such a case, the validate function displays an error message and puts the cursor into the box on the form where user input is missing. Where this function gets called will be seen below.

The rest of the web page is the BODY. The opening BODY tag is also used to set up the colors to be used for ordinary text, for links, etc. The opening and closing H2 tags are used to indicate a header. The UL tags are used to give an unordered list. Each item in the list is enclosed in LI (list item) tags. Most browsers show an unordered list as a bulleted list.

Now we reach the most important section. We see the opening FORM tag. This begins the form that the user will fill out. The onSubmit="return validate(this)" calls our validate function (explained above) when the user clicks on the submit button. This could be omitted and is here simply to help out users who forget to fill in the input box on the form. POST indicates that when the user clicks on the submit button (assuming that validate does not find an empty input box), the user's data will be sent to the web server in an action referred to as a POST. In addition, the ACTION="/cgi-bin/uncgi/websearch" tells the web server what to do when it receives this data. In this case, it says to send the user's data to the uncgi program found in the cgi-bin folder (found under the webroot) and to then run the websearch CGI program (our compiled C++ search program) after this.

You probably wonder at this point how the user data gets to the websearch program since we instead sent it to uncgi. The purpose of uncgi is to break apart the user input (when there is more than one input box on the form) and to place it into appropriately named variables that your CGI programs can access. The type of variable used is called an environment variable and the name of each starts with the characters WWW_. Thus our websearch program will find the user input in a certain variable.

The P tags are used to enclose a paragraph, though in our websearch.html file we don't have paragraphs of text. Instead the first paragraph contains a text field named Key of length 36. Next to this field is shown the label "Keyword or phrase". The STRONG tags around it are typically interpreted by browsers as the same as B (bold) tags. This field is where the user places the target string for which to search. The field is wide enough to display 36 characters, but the user is allowed to type in up to 48 (MAXLENGTH) characters. Since this field is named Key, the uncgi program will place the user data from this field into an environment variable name WWW_Key.

Many forms would have other input fields, but ours has only the one. The only other item on this form another paragraph containing a submit button and a reset button. The VALUE items give the text to display on these buttons. The reset button simply clears out the data that the user has placed on the form in case the user wants to start over. The important button is the submit button, since it starts all of the action: the validating that there is data in the field on the from, the sending of the data to uncgi, and the running of the CGI program websearch.

We write our websearch.cpp program with the aid of the stringhelp.cpp and stringhelp.h files we used in Web Search I. Here are these two old files and the complete websearch.cpp file:

This time we use another function from stringhelp.cpp for use with StringType strings. The SubstringPresent and MyGetLine functions are familiar from before. Now we also use the GetValue function as well. As you see from its comments, this function looks up the value of an environment variable (such as the WWW_Key variable that uncgi sets up from the user input from our form). Read the GetValue comment section to see how this function accepts only reasonable characters and returns the empty string if a potentially malicious character is encountered. GetValue also truncates the user data if need be so that it fits properly (including the NULL marking the end of the string) into the StringType variable used to hold the result.

Now we are ready to look at websearch.cpp itself. Read the comments at the top of websearch.cpp to be sure that you understand what the program does overall. The comments also tell you how to compile it in Linux and where the compiled websearch and associated files should be placed on your Linux web server.

The main function simply opens our weblist file of pathnames, calls the SearchFile function, and then closes the weblist file. The first significant thing that SearchFile does is to use GetValue("WWW_Key", Key) to obtain the target string from the WWW_Key Linux environment variable. The length of the target string is found with the usual strlen function.

The SearchFile function prints its results to standard output. This output includes HTML tags and information on the matches that the search found for the target string. In general, the output of a CGI program is sent by the web server to the user's browser. Thus what our SearchFile function appears to be printing on the screen is really what is being sent to the user's browser as the results of the search.

We must follow proper protocol so that the browser receives the type of information that it can handle. For an HTML file, browsers expect to receive a line that says "Content-type: text/html" (without the quotes), followed by a blank line, and then followed by the opening HTML tag and the rest of the marked-up web page. That is why our SearchFile function prints "Content-type: text/html" followed by 2 endlines. The extra endline gives the expected blank line.

The PrintFile(HEAD) uses a helping function to quickly print the contents of webhead.html (shown below) to standard output. As you can see in the box below, this file contains the standard opening HTML for the web page of search results. The PrintFile function itself is an easy one: It opens the indicated file, and reads line after line of it, printing each line read to standard output.


<HTML>

<HEAD>
<TITLE>Simple Web Page Search Results</TITLE>
</HEAD>

<BODY TEXT="#000000" BGCOLOR="#FFFFFF" LINK="#000099" VLINK="#000099" ALINK="#FF0000">

<H1>Simple Web Page Search Results</H1>

SearchFile then goes on to output HTML tags, messages, and the URLs for the matches in our search. It begins with an OL tag, which is used to give an ordered (numbered) list. Each item in this list is marked up with LI tags (to indicate a list item). Each such list item gives the URL for a match or, for the case where the target string has length zero, a message saying that the target had length zero or contained unallowed characters (which resulted in GetValue sending us an empty target string).

Assuming that KeyLength, the length of the target string, is not zero, our SearchFile function goes on to do a sequential search of the weblist file in a similar fashion to what we saw in the search.cpp used in part I of this series. What is different is that when we print out the URL for a match, we print it marked up as a list item for our results web page. What we want sent to standard output is something like the example shown here:


<LI><A HREF="http://cis.stvincent.edu/ex.html">http://cis.stvincent.edu/ex.html</A></LI>

You already know that the LI tags indicate a list item. The A (anchor) tag is used here with HREF to give a clickable link. The string after the = sign is the URL to go to when the link is clicked, while the string between the opening and closing A tags is the text displayed as the link. In our case both strings are the same: they are the URL for our match.

We use the PrintURL function that we used in our previous project to print the URL for a pathname to an HTML file matching the desired target. There is, however, one small change in the PrintURL function: It no longer prints an endline after the URL, as we have no need for that here. The other difficult item to print is the double quote, since it is used to begin and end literal strings. The solution used here is to put the ASCII code for the double quote into a character variable, called Quote, and to print that.

Our SearchFile function finishes by printing the closing HTML tags for the ordered list, the body of the web page, and HTML itself.

Compile websearch.cpp as indicated by the comments contained in it. The compiled executable, named websearch, should be copied to the cgi-bin folder for your web server. The weblist and webhead.html should also be copied to this location. The websearch executable should be given 755 permissions and the weblist and webhead.html files should be given 644 permissions as shown here:


chmod 755 websearch
chmod 644 weblist webhead.html

The websearch.html file should be placed in whatever location within your web server that you desire. For example, you might want it in the webroot or in your own subfolder within the web root. Its permissions should be 644.

To test the completed web interface and search program, point your browser to the proper URL so that it displays websearch.html. For example, if you placed websearch.html in the webroot, the URL might be something like http://cis.stvincent.edu/websearch.html, though things will vary on your web server. Fill in on the form a reasonable target string (one that is found in your weblist file) and click the submit button. Once you have this working, try target strings that do not appear in your weblist file, the empty string, strings that contain unexpected characters such as # or a comma, etc.

Homework

Once again you should try a similar project that takes things a bit further. Here is your challenge:

This homework utilizes the keywordfile that was used in the previous homework for part I of this series. Change websearch.cpp so that it uses the keywordfile as on our Linux system. (Saint Vincent students should contact Br. David for the location and other information on this file. Other readers might manually construct a keywordfile that fits the format shown in the section about this under the above link or might try the getmeta script found under the above link. (To have working links on your search results web page, you need to use data for actual HTML files on your web server.)

As before, look for the target string in the keywords section only. Do not allow a partial match. That is, the target string must match an entire keyword (or phrase), not just part of it. To do this, create a new target string that consists of a # symbol, the old target, and another #. Then use SubstringPresent to look for this new target. Use GetSubstring to extract the pieces of a matching line of the keywordfile. Thus, the path should be extracted (and converted to a URL), and the description should be extracted so that it can be printed separately. As in the example given in this web page, the information on any matches should be displayed in an ordered list on a results web page. The same techniques used in our example above should work fine.

After completing this project, if you want another challenge, try rewriting the same project using the usual string type instead of the StringType that I created. Those who are interested in creating a similar search engine for the Windows environment could pursue that idea. It is possible to use VB script (in what is sometimes called Windows Script Host) to create the text file of pathnames for HTML files, now perhaps called keywordfile.txt. Or, you could simply create a small keywordfile.txt manually with a text editor. The search engine could be created as a Windows forms application, though the material we have posted is for stand-alone Windows apps, not web apps. (We do have some material posted on creating VB .NET web apps, so you might try that approach if you want to get a web interface to your search program.)

High-Performance Search Engines

Has the challenge of creating useful search engines captured your interest? If so, you might want to study how large Internet search engines work. The methods used in our case study are OK for searching a single, small web site but do not scale well to searching the Internet. A good overview of search engine techniques can be found in the brief book Understanding Search Engines: Mathematical Modeling and Text Retrieval, 2nd ed., by Michael W. Berry and Murray Browne, SIAM (2005). Who knows? Perhaps you will be one of those who will create the next generation of Internet search engines!