Software Design Using C++
Web Search I (Simple Web Search in Linux)
This is the first of a two-part case study on web search. It is specific to Linux, and assumes that you have an account on a Linux web server, or that you at least have an account on a Linux computer where you can set up a bunch of files that imitate what might be present on a web server. You also need to have a C++ compiler on your Linux system. Our case study assumes that the compiler is g++, though others are likely to work as well.
Have you every wondered how the well-known Internet search engines work? They allow you to find documents scattered across the Internet that match your search conditions, and they generally do so very quickly. Creating a powerful search engine like this is beyond what we can do here, but as a start we can do some simple searching of documents on one web server. Perhaps you will be motivated to look further into search engines and someday work on the next generation of these!
A Simple Search
Let's begin with a pretty simple web search. It will only handle HTML files on your own web server. We will assume that we have (or can create) a list of the HTML files on this web server (or perhaps just those under some particular directory within the webroot). This list of HTML files is in a text file that has the complete pathname for one of these HTML files on each line. A really short example of such a file, which we will name weblist, is shown here:
You can use this sample weblist file if you wish, though it would be more interesting to use a weblist file containing real data for HTML files on your Linux web server. One way to do this is to manually edit your weblist file, typing in line by line the pathnames of various HTML files on your server. This would be too tedious for handling a lot of data, but would be reasonable for a sample weblist file that is only a few dozen lines long. If you want to process a fair amount of data, you might be able to use the following bash script:
To use this script, copy it into a new file on your Linux system. Name the file makelist and set its permissions to 700, which you can do with the command:
To run makelist, cd to the directory containing the makelist script, then run it as in this example:
The command-line parameter can be anything reasonable on your system. The /www/carlson/cs125 is probably useless on your server. The idea is that makelist produces the list of complete pathnames for all HTML files located anywhere under the folder given as the one parameter.
OK. We have our weblist file, as above. The actual search program will be a simple command-line program that prompts the user for the target string for which to look. The program then should do a sequential search through the weblist file and print all the lines where the target string was found within the line. It would be nicer to get a URL instead of a pathname, but we will worry about this in a later example.
Note that this will be a rather primitive search program in various ways. The user interface is clumsy. The search simply looks for a certain substring in the pathnames for the HTML files, which may or may not produce useful results. For example, in the sample weblist above, searching for cs125 would give a match on every line. However, searching for homework or for math would give fewer lines, and a search for syll would give just one line. Searching for view would yield just a few lines, and the results might not at all be what you want, since the filenames use the word review, which has a rather different meaning than the substring view. Note well that our search program does not look at the contents of the HTML files, though that could give a more helpful search program. The output to our simple search is also limited in that all we get is a list of filenames. There is no description of any of these files.
In spite of these shortcomings, let's continue. These limitations are deliberate as they allow us to create our first search program pretty quickly.
Before doing any actual coding, let's be sure that we have the program's specifications and design pretty well set. This is to be a command-line program (in Linux) that prompts the user for a target string. The program then does a sequential search of the text file weblist (with the format shown above) for all lines that contain the target string as a substring. All matching lines are printed as the output of the program. In addition, the weblist file is assumed to be in the same directory as the search program, so that it is easily available. Note that our program does just one search, though we could easily modify it to do repeated searches. The rough algorithm to do the sequential search through the weblist file is shown in the following pseudocode:
Since we will need to do a number of string operations, both in this and in later search programs, a collection of string functions is supplied for you in the following two files. This collection of functions works with C-style strings of type StringType. Such a string is just an array of StrMax characters. StrMax must be chosen to be large enough to handle the types of strings to be used in our search program. (We could have chosen to use type string, but for the current program let's use StringType.) We will not explain these functions in detail, but do read through the comments on each function so that you know what it does and how to use it. If you are interested in how these functions are implemented, read through the code inside of the functions carefully.
Now for the coding. The main program will just open the text file, call a search function, and then close the text file. This gives us the following:
The search function pretty much follows the pseudocode given above, except that the opening and closing of the weblist file are handled in the main function. The code for the search function is shown below. Notice how MyGetLine returns the number of characters it read. This can nicely be used to control the loop for the sequential search. Similarly, MyGetLine's return value gives you the length of the target string, so that there is no need to use the usual strlen function. If this length is not positive, we print an error message and then wish to stop the program. Since we want to be sure to close the file before ending the program, we do a return (not an exit) to get out of the search function. The main function then closes the file and ends normally.
The complete project file is given under the link below. Comments (including how to compile and run the program) are included).
Before attempting to run the simplesearch program, be sure that the executable permission is turned on. You can manually turn it on as follows:
Next, be sure that the weblist file is in the current directory, along with the simplesearch executable. Then use ./simplesearch at the command line to run the program. Hopefully your program will prompt you for a target string and then print the matching lines, as expected.
A Better Search
Let's modify our search program so that it is a little more useful. In particular, let's have it handle repeated searches instead of just one. We will have the user enter the empty string as the target when the user wishes to stop the program. Let's make the output more helpful, too, by giving the URL corresponding to each matching pathname instead of the pathname itself. For example, if your weblist file contains these 2 lines:
Then if you look up cs125, your output should include these lines:
Notice how the pathnames have been converted into URLs.
These are small modifications to our previous program. You may want to try these on your own and then compare your solution to the following:
There is a new function, PrintURL, which converts the pathname for a matching HTML file into a proper URL. It does this by printing the standard start for every URL on your web server and then skipping over the /www at the start of the pathname. Note that we use a FOR loop to print the pathname (without the initial /www) and that we print the characters one at a time by using the put function. Other ways of implementing the same functionality are, of course, possible. Read through the code to see the details.
How about a related project for you to try? Here is your challenge:
Change search.cpp so that it uses the keywordfile as on our Linux system. (Saint Vincent students should contact Br. David for the location and other information on this file. Other readers might manually construct a keywordfile that fits the format shown below. (It would be best to have one that matches some of the actual data on your Linux web server, but even fake data could be used). You might also be able to get your Linux system administrator to adapt and use the getmeta script to create a keywordfile of the proper type. This script requires a file called getmeta.subdirs whose contents must be adapted to indicate the subdirectories of webroot that you want to index on your web server. It also requires a compiled C utility. The source code is removespace.c and the directions for compiling it are contained within the source code file. Note that the getmeta script must be adjusted so that key variables are initialized to values that fit your web server situation. Even the scripting itself might need to be adjusted depending on the details of your version of Linux and other system-specific information. The compiled removespace program and the getmeta.subdirs file should be located in the same directory as the getmeta script.
In any case, this keywordfile contains on each line information about one html file. The format of each line is as in this example:
A starting /www is assumed. Thus the pathname for this html file is /www/html/ed.html The section between the | symbol and the first # is the description of this file. In our example, the description is: K-12 Education Links The strings between # symbols are the keywords (keyphrases) that can be used to look up this html file.
Look for the target string in the keywords section only. It is difficult to decide whether or not to allow a partial match. For example, in the above, should a target string of "Web" result in the display of this web page or should the complete keyphrase "Education Web Sites" be needed (or "Education Links", "K-12", or "Education"? If you do not allow partial matches, then a target of "Education Web Site" will not find this html page, as the final 's' is missing in the target. On the other hand allowing partial matches would find this page with a target of "ink".
So, let's decide this question for you: do NOT allow a partial keyword (or keyphrase) match. Insist on an exact match between the target string and a string between two # symbols. To do this, create a new target string that consists of a # symbol, the old target, and another #. Then use SubstringPresent to look for this new target. Use GetSubstring to extract the pieces of a matching line of the keywordfile. Thus, the path should be extracted (and converted to a URL), and the description should be extracted so that it can be printed separately.
For example, if you search for "Education Links" the above html file is a match. You should thus print something like the following. Notice that the description is printed first and then the URL.
After completing this project, if you want another challenge, try rewriting the same project using the usual string type instead of the StringType that I created.