Search


Software Design Using C++



Web Search I (Simple Web Search in Linux)



Introduction


This is the first of a two-part case study on web search. It is specific to Linux, and assumes that you have an account on a Linux web server, or that you at least have an account on a Linux computer where you can set up a bunch of files that imitate what might be present on a web server. You also need to have a C++ compiler on your Linux system. Our case study assumes that the compiler is g++, though others are likely to work as well.

Have you every wondered how the well-known Internet search engines work? They allow you to find documents scattered across the Internet that match your search conditions, and they generally do so very quickly. Creating a powerful search engine like this is beyond what we can do here, but as a start we can do some simple searching of documents on one web server. Perhaps you will be motivated to look further into search engines and someday work on the next generation of these!

A Simple Search


Let's begin with a pretty simple web search. It will only handle HTML files on your own web server. We will assume that we have (or can create) a list of the HTML files on this web server (or perhaps just those under some particular directory within the webroot). This list of HTML files is in a text file that has the complete pathname for one of these HTML files on each line. A really short example of such a file, which we will name weblist, is shown here:


/www/carlson/cs125/final.html
/www/carlson/cs125/math/homework2.html
/www/carlson/cs125/math/homework3.html
/www/carlson/cs125/math/notes.html
/www/carlson/cs125/math/homework4.html
/www/carlson/cs125/math/homework5.html
/www/carlson/cs125/math/homework6.html
/www/carlson/cs125/math/homework7.html
/www/carlson/cs125/math/review.html
/www/carlson/cs125/math/answer1.html
/www/carlson/cs125/math/answer3.html
/www/carlson/cs125/math/answer4.html
/www/carlson/cs125/math/answer5.html
/www/carlson/cs125/math/math.html
/www/carlson/cs125/math/mathnotes.html
/www/carlson/cs125/menu.html
/www/carlson/cs125/hw2a.html
/www/carlson/cs125/lasthomework.html
/www/carlson/cs125/hw8.html
/www/carlson/cs125/hw4answer.html
/www/carlson/cs125/hwpractice.html
/www/carlson/cs125/Hw4.html
/www/carlson/cs125/homework1.html
/www/carlson/cs125/review1.html
/www/carlson/cs125/homework2.html
/www/carlson/cs125/review2.html
/www/carlson/cs125/hw6.html
/www/carlson/cs125/syll125.html
/www/carlson/cs125/script1.html
/www/carlson/cs125/script3.html
/www/carlson/cs125/assign.html
/www/carlson/cs125/script2.html
/www/carlson/cs125/hw1.html
/www/carlson/cs125/hw2.html
/www/carlson/cs125/hw3.html
/www/carlson/cs125/hw5.html
/www/carlson/cs125/review3.html
/www/carlson/cs125/homework3.html

You can use this sample weblist file if you wish, though it would be more interesting to use a weblist file containing real data for HTML files on your Linux web server. One way to do this is to manually edit your weblist file, typing in line by line the pathnames of various HTML files on your server. This would be too tedious for handling a lot of data, but would be reasonable for a sample weblist file that is only a few dozen lines long. If you want to process a fair amount of data, you might be able to use the following bash script:


#! /bin/sh
#
#  Creates weblist, a text file containing the list of html files
#  found anywhere in the directory tree rooted at the directory
#  given as the first (and only) parameter to this script.
#
#  Warning: only a few error checks are made. Other things can still go wrong.

if [ $# != 1 ]
then
   echo "ERROR: One parameter needed, the name of the web folder to start at in finding html files"
   exit 1
elif [ ! -d "$1" ]
then
   echo "ERROR: $1 is not a directory"
   exit 2
fi

find "$1" -name '*.html' > weblist
exit 0

To use this script, copy it into a new file on your Linux system. Name the file makelist and set its permissions to 700, which you can do with the command:


chmod 700 makelist

To run makelist, cd to the directory containing the makelist script, then run it as in this example:


./makelist /www/carlson/cs125

The command-line parameter can be anything reasonable on your system. The /www/carlson/cs125 is probably useless on your server. The idea is that makelist produces the list of complete pathnames for all HTML files located anywhere under the folder given as the one parameter.

OK. We have our weblist file, as above. The actual search program will be a simple command-line program that prompts the user for the target string for which to look. The program then should do a sequential search through the weblist file and print all the lines where the target string was found within the line. It would be nicer to get a URL instead of a pathname, but we will worry about this in a later example.

Note that this will be a rather primitive search program in various ways. The user interface is clumsy. The search simply looks for a certain substring in the pathnames for the HTML files, which may or may not produce useful results. For example, in the sample weblist above, searching for cs125 would give a match on every line. However, searching for homework or for math would give fewer lines, and a search for syll would give just one line. Searching for view would yield just a few lines, and the results might not at all be what you want, since the filenames use the word review, which has a rather different meaning than the substring view. Note well that our search program does not look at the contents of the HTML files, though that could give a more helpful search program. The output to our simple search is also limited in that all we get is a list of filenames. There is no description of any of these files.

In spite of these shortcomings, let's continue. These limitations are deliberate as they allow us to create our first search program pretty quickly.

Before doing any actual coding, let's be sure that we have the program's specifications and design pretty well set. This is to be a command-line program (in Linux) that prompts the user for a target string. The program then does a sequential search of the text file weblist (with the format shown above) for all lines that contain the target string as a substring. All matching lines are printed as the output of the program. In addition, the weblist file is assumed to be in the same directory as the search program, so that it is easily available. Note that our program does just one search, though we could easily modify it to do repeated searches. The rough algorithm to do the sequential search through the weblist file is shown in the following pseudocode:


open weblist file for reading;
get the target string for which to search;
read a line of the file;
while we read more than the empty string
   {
   if target string is a substring of this line
      output this line;
   read the next line of the file;
   }
close weblist file;

Since we will need to do a number of string operations, both in this and in later search programs, a collection of string functions is supplied for you in the following two files. This collection of functions works with C-style strings of type StringType. Such a string is just an array of StrMax characters. StrMax must be chosen to be large enough to handle the types of strings to be used in our search program. (We could have chosen to use type string, but for the current program let's use StringType.) We will not explain these functions in detail, but do read through the comments on each function so that you know what it does and how to use it. If you are interested in how these functions are implemented, read through the code inside of the functions carefully.

The functions in this collection that appear to be of use for our current project are MyGetLine (which we can use to read the user's target string from cin and to read from the text file stream a line of the weblist file) and SubstringPresent (which we can use to see if the target string is a substring of a line from the weblist). We could use the usual C++ getline function, but let's use the MyGetLine function given here, partly because we can read its comments and code, and hence know exactly what it does (and does not) do. Note that stringhelp.h uses namespace std and includes header files that we are likely to need (see below). Thus we will not need to put these into the cpp file for our project; we will only need to do an include of stringhelp.h.


#include <iostream>
#include <fstream>
#include <cstring>
#include <cctype>
#include <cstdlib>
using namespace std;

Now for the coding. The main program will just open the text file, call a search function, and then close the text file. This gives us the following:


#include "stringhelp.h"

// Next line should give the name of the text file to be searched:
#define DATAFILE    "weblist"

// Function prototype:
void SearchFile(fstream & fs);

int main(void)
   {
   fstream InFile;

   InFile.open(DATAFILE, ios::in);
   if (InFile.fail())
      {
      cout << "Error: Cannot open file named " << DATAFILE << " for input" << endl;
      exit(1);
      }

   SearchFile(InFile);
   InFile.close();

   return 0;
   }

The search function pretty much follows the pseudocode given above, except that the opening and closing of the weblist file are handled in the main function. The code for the search function is shown below. Notice how MyGetLine returns the number of characters it read. This can nicely be used to control the loop for the sequential search. Similarly, MyGetLine's return value gives you the length of the target string, so that there is no need to use the usual strlen function. If this length is not positive, we print an error message and then wish to stop the program. Since we want to be sure to close the file before ending the program, we do a return (not an exit) to get out of the search function. The main function then closes the file and ends normally.


/* Given:   fs   Text file stream already opened for input.
   Task:    To prompt the user to enter a string and then to carry out a
            sequential search of the fs file for all instances of the target
            string.  Each matching line of the file is printed for the user.
   Return:  Nothing.
*/
void SearchFile(fstream & fs)
   {
   StringType Target, Line;
   int TargetLength, LineLength;

   cout << "Enter the target string that you wish to search for in the list of html files:" << endl;
   TargetLength = MyGetLine(cin, Target, StrMax);
   cout << endl <<"Matches:" << endl << endl;

   if (TargetLength == 0)
      {
      cout << "Error: Cannot use an empty target string" << endl;
      return;
      }

   LineLength = MyGetLine(fs, Line, StrMax);

   while (LineLength > 0)
      {
      if (SubstringPresent(Line, Target))
         cout << Line << endl;
      LineLength = MyGetLine(fs, Line, StrMax);
      }
   }

The complete project file is given under the link below. Comments (including how to compile and run the program) are included).

The suggested Linux command for compiling the program, g++ simplesearch.cpp stringhelp.cpp -o simplesearch -s, uses the g++ compiler to compile simplesearch.cpp and stringhelp.cpp into an executable ("output" file) named simplesearch. The -s option strips off the symbol table, which is not needed unless you want to run a debugger. Stripping the symbol table results in a smaller executable.

Before attempting to run the simplesearch program, be sure that the executable permission is turned on. You can manually turn it on as follows:


chmod 700 simplesearch

Next, be sure that the weblist file is in the current directory, along with the simplesearch executable. Then use ./simplesearch at the command line to run the program. Hopefully your program will prompt you for a target string and then print the matching lines, as expected.

A Better Search


Let's modify our search program so that it is a little more useful. In particular, let's have it handle repeated searches instead of just one. We will have the user enter the empty string as the target when the user wishes to stop the program. Let's make the output more helpful, too, by giving the URL corresponding to each matching pathname instead of the pathname itself. For example, if your weblist file contains these 2 lines:


/www/carlson/cs125/final.html
/www/carlson/cs125/hw/homework2.html

Then if you look up cs125, your output should include these lines:


http://cis.stvincent.edu/carlson/cs125/final.html
http://cis.stvincent.edu/carlson/cs125/hw/homework2.html

Notice how the pathnames have been converted into URLs.

These are small modifications to our previous program. You may want to try these on your own and then compare your solution to the following:

As you can see, the main function uses a while loop to allow the user to do repeated searches. The loop is controlled by the length of the target string entered by the user. (MyGetLine nicely returns this length, as we saw earlier.) The loop ends when the user's most recently entered target string has length zero.

There is a new function, PrintURL, which converts the pathname for a matching HTML file into a proper URL. It does this by printing the standard start for every URL on your web server and then skipping over the /www at the start of the pathname. Note that we use a FOR loop to print the pathname (without the initial /www) and that we print the characters one at a time by using the put function. Other ways of implementing the same functionality are, of course, possible. Read through the code to see the details.

Homework


How about a related project for you to try? Here is your challenge:

Change search.cpp so that it uses the keywordfile as on our Linux system. (Saint Vincent students should contact Br. David for the location and other information on this file. Other readers might manually construct a keywordfile that fits the format shown below. (It would be best to have one that matches some of the actual data on your Linux web server, but even fake data could be used). You might also be able to get your Linux system administrator to adapt and use the getmeta script to create a keywordfile of the proper type. This script requires a file called getmeta.subdirs whose contents must be adapted to indicate the subdirectories of webroot that you want to index on your web server. It also requires a compiled C utility. The source code is removespace.c and the directions for compiling it are contained within the source code file. Note that the getmeta script must be adjusted so that key variables are initialized to values that fit your web server situation. Even the scripting itself might need to be adjusted depending on the details of your version of Linux and other system-specific information. The compiled removespace program and the getmeta.subdirs file should be located in the same directory as the getmeta script.

In any case, this keywordfile contains on each line information about one html file. The format of each line is as in this example:


/html/ed.html|K-12 Education Links#Education#K-12#Education Links#Education Web Sites#

A starting /www is assumed. Thus the pathname for this html file is /www/html/ed.html The section between the | symbol and the first # is the description of this file. In our example, the description is: K-12 Education Links The strings between # symbols are the keywords (keyphrases) that can be used to look up this html file.

Look for the target string in the keywords section only. It is difficult to decide whether or not to allow a partial match. For example, in the above, should a target string of "Web" result in the display of this web page or should the complete keyphrase "Education Web Sites" be needed (or "Education Links", "K-12", or "Education"? If you do not allow partial matches, then a target of "Education Web Site" will not find this html page, as the final 's' is missing in the target. On the other hand allowing partial matches would find this page with a target of "ink".

So, let's decide this question for you: do NOT allow a partial keyword (or keyphrase) match. Insist on an exact match between the target string and a string between two # symbols. To do this, create a new target string that consists of a # symbol, the old target, and another #. Then use SubstringPresent to look for this new target. Use GetSubstring to extract the pieces of a matching line of the keywordfile. Thus, the path should be extracted (and converted to a URL), and the description should be extracted so that it can be printed separately.

For example, if you search for "Education Links" the above html file is a match. You should thus print something like the following. Notice that the description is printed first and then the URL.


K-12 Education Links     http://cis.stvincent.edu/html/ed.html

After completing this project, if you want another challenge, try rewriting the same project using the usual string type instead of the StringType that I created.

Related Items

Back to the main page for Software Design Using C++

Authors: Br. David Carlson and Br. Isidore Minerd
Last updated: April 23, 2013