dcsimg

Simple Online Quiz Technique — Part I

I have a pretty long list of "write a magazine article about this someday" items. But I could always use more, so if you want to see your name in print, please e-mail your ideas to me, and you'll be appropriately credited! One item that's been in there for nearly as long as I have been keeping a list is "show how to design an online quiz correctly so that people can't cheat." Why this? Well, far too often, I've seen "Web quiz" freeware that was all too trivial. The right answer was either guessable via staring at the mouseover URLs, or I could simply hit the "back" button and try a different answer if I got one wrong.

I have a pretty long list of “write a magazine article about this someday” items. But I could always use more, so if you want to see your name in print, please e-mail your ideas to me, and you’ll be appropriately credited! One item that’s been in there for nearly as long as I have been keeping a list is “show how to design an online quiz correctly so that people can’t cheat.” Why this? Well, far too often, I’ve seen “Web quiz” freeware that was all too trivial. The right answer was either guessable via staring at the mouseover URLs, or I could simply hit the “back” button and try a different answer if I got one wrong.

So, I gave it some thought and came up with a scheme that was simple and permitted me to generate random quizzes that prevented anyone from cheating. However, I hate to write content. If you have visited my Web site, you know the only content that really changes is the online archive of magazine articles I’ve written and the code that results from writing them. So, if I wanted to illustrate a quiz in an article, I needed to come up with content.

Luckily, the other day, someone on Void (a mailing list populated by twenty-something, new-media hackers from London, many of whom are big Perl fans) posed a “trivia” question that got me thinking. He quoted a paragraph from the screenit.com Web site. Each week, nearly every new movie opening in the US is thoroughly reviewed — not only for artistic merit, but also in terms of “parental information” organized into 16 categories (such as “Alcohol/Drug Use” and “Frightening/Tense Scenes”). At the top of the screenit. com review is a scoreboard, but the details further down the page give a meticulously detailed amount of information about how appropriate a particular movie is for kids or your big date with that special someone.

The “profanity” paragraph from one of the movies (listing all the words used in the film that a parent would be concerned with) was sent, and the challenge was made to guess the movie. Because the paragraph seemed to go on and on, a few guessed South Park, which probably takes the number three slot for excessive profanity. However, the correct answer was The Big Lebowski.

I got to thinking, “Here’s a lot of free content for an interesting quiz.” After all, their archives go back a few years, and the data format is regular (not quite regular enough though…as we’ll see later).

So I hacked a program to grab the data and another one to implement the quizzing architecture I had already sketched. This month, we’ll look at the data grabber; you’ll have to wait until next month to see the quizzer. Sorry, it’s too much for one month.

For the data grabber, I used Google to search for the movies. Google has this nice under-documented “site-only” search feature; by adding site:screenit.com to a query, I get hits only from that site. By adding “profanity” to the query, I get back a list of the pages that included that word at least once, a narrow-enough query that I got a lot of high-quality hits.

Of course, Google doles out the hits at most 100 at a time, so I had to repeat the query using an increasing starting point until I got all the hits it could give me. We’ll see how that works when we get to the code below.

Now for the cool stuff — part of the link that Google returns is a pointer to Google’s own cache for that page. So, rather than following the link back to screenit.com’s site and causing stress on their sometimes-overloaded server, I simply ask Google for its cached version! After all, if it’s not in Google’s cache, it won’t be returned in a search.

From that cached page, I look into the data to find the profanity paragraph and the movie title. The movie title is somewhat related to the screenit.com URL, but I didn’t rely on that, since some of them seemed to be arbitrary.

Finally, for permanent storage (my cache of their cache), I used a simple DBM database, accessible in Perl as a hash (a hash cache), allowing for easy programming and updates and queries. I had first considered using a MySQL database, but this turned out to be much easier.

The profanity-grabbing program is presented in Listing One (pg. 86).

Lines 1 through 3 begin nearly every program I write, enabling warnings, turning on the common compiler restrictions (mandatory for programs longer than 10 lines), and disabling the buffering of standard output.

Line 6 defines the only configuration constant for this program — the location of the database that this program and the quiz program share. This has to be in a directory that is accessible from CGI programs, although it should not be in a directory that is mapped to a URL. You wouldn’t want it to be that easy to cheat.

Lines 9 through 11 pull in the Web access modules (part of the LWP library) to fetch the data from Google.

Line 13 connects us to our DBM database. Yes, purists may point out that dbmopen is officially “deprecated,” but it’s still the most convenient interface for quick-and-dirty programs like this.

Lines 15 through 18 remind me of the format of the database. I picked this format out of sheer laziness, not because of any design. The key is the URL minus the scheme, because that’s how Google reports it for an internal link. The value consists of the real name of the movie, followed by a newline and then the paragraph of the profanity information. I didn’t do any cleanup on the paragraphs, so most of them still contain an embedded “A HREF=…” element. Please remember that laziness is a Perl virtue.

Line 20 creates the “user agent” object, acting as a tiny browser to view the Web. If I had needed to set up a particular configuration, such as Web proxy information or a particular user agent, I’d have done that here too.

Line 22 starts the “outer loop.” We’ll loop once for each index page from Google, stepping through result hits 100 at a time.

Lines 24 through 26 set up the Google URL for this particular query. We’ll ask for links into screenit.com that contain the word “profanity,” 100 at a time, starting at the link numbered in $start , and disable the “similar pages” filter. These parameters were reverse-engineered by typing those queries directly into Google and watching the resulting links being provided. So if Google changes format, this breaks. Such is the way of screen scraping.

Lines 27 and 28 ask Google for the response, aborting the outer loop if it breaks.

Lines 31 and 32 look within the Google response for links to the cache, which means we’ve stumbled across some pages that Google has seen at screenit.com. If none, again, we simply abort the outer loop.

Line 35 begins the “inner loop,” cycling once for each URL in Google’s cache, which should be a page at screenit.com containing the word “profanity.”

Lines 36 through 39 skip any URL that doesn’t look like a movie. Again, this was determined by looking at the returned links and noting that screenit.com also rates music CDs and other things. Movies are all displayed as title/year (where year is a four-digit number), so we ignore anything not noted in such a manner.

Lines 40 to 43 skip over any URL for which we’ve already got good data. This permits us to rerun this program quickly (as often as once a day) without spending a lot of time re-fetching data we’ve already extracted. This step ensures that the DBM is keyed by the URL rather than the movie title; we want this lookup to be very fast.

Lines 46 to 52 “follow” the cache URL link, grabbing the data from Google’s cache. Note that at no time are we actually touching screenit.com’s Web site; we’re simply making a cache of a cache. I think this is important in terms of being a good neighbor of such a valuable service. If we can’t get the cache, we simply ignore it.

Lines 54 through 70 locate the “profanity” paragraph within the response. I’m using an “extended” regular expression here, so the white space within the regular expression spanning lines 56 to 65 is simply ignored. Again, the pattern here came from staring at enough entries to figure out the possible range. Even so, there’s a few movies that don’t match properly, but I’ve got enough for a good quiz. The pattern puts the profanity paragraph in the $1 match variable, which is saved in $prof in line 70.

Similarly, lines 73 to 86 find the movie title by extracting the HTML page title. Some of the titles have a preceding label, optionally matched in lines 75 to 77. Again, this came from running the program a half-dozen times, constantly tweaking what was and wasn’t matching. Line 87 dumps the title to standard output as an indication that we got good matching data on this movie, including its title.

Line 90 stores the data in the database through a simple DBM assignment.

Line 94 reports the total of movies in the quiz database, which is currently 933 movies — plenty for this quiz.

So, we plop this into a file, adjust the one configuration parameter, and run it. Out pops a database of movies and their screenit.com profanity paragraphs, and thanks to our friends at Google, without ever having had to touch screenit. com. Next month, we’ll examine how to build a CGI quiz script using this database. Until then, watch a movie or two at the theatre or rent a DVD. Enjoy!




Listing One: Data Grabber — Part I


1 #!/usr/bin/perl -w
2 use strict;
3 $|++;
4
5 ## config
6 my $DATA_DB = “/home/merlyn/Web/profanity_quiz”;
7 ## end config
8
9 use LWP::UserAgent;
10 use HTTP::Request::Common;
11 use URI;
12
13 dbmopen my %DATA, $DATA_DB, 0644 or die
“Cannot open db: $!”;
14
15 ## %DATA format: for each movie, keyed by partial URL,
16 ## value is “$title\n$profanity_paragraph_
with_newlines”, as in:
17 ## $DATA{“www.screenit.com/movies/1997/
gone_fishin.html”} =
18 ## “GONE FISHIN’\n<DL>many\nlines\n</DL>\n”;
19
20 my $ua = LWP::UserAgent->new;
21
22 for (my $start = 0; ; $start += 100) {
23 ## fetch each index page:
24 my $uri = URI->new(“http://www.google.com/search“);
25 $uri->query_form(‘q’ => “site:screenit.com profanity”,
26 ‘num’ => 100, ‘start’
=> $start, ‘filter’
=> 0);
27 my $response = $ua->simple_request (GET $uri);
28 last unless $response->is_success;
29
30 ## parse the index page looking for
links to movie pages in cache:
31 my @urls = $response->content =~ m{A
HREF=/search\?q=cache:(.*?)\+}g;
32 last unless @urls;
33
34 ## fetch each cached movie page if
it fits the profile:
35 for my $url (@urls) {
36 unless($url=~ m{movies/\d\d\d\d/}){
37 print “skipping $url\n”;
38 next;
39 }
40 if ($DATA{$url}) {
41 print “skipping $url because we have it\n”;
42 next;
43 }
44
45 ## get cached movie page from cache:
46 $uri->query_form(‘q’ > “cache:$url”);
47 my $res = $ua->simple_request (GET $uri);
48 print $uri, ” ==>\n”;
49 unless ($res->is_success) {
50 print “___ FAILURE___\n”,$res->
as_string, “______\n”;
51 next;
52 }
53
54 ## look for profanity paragraph:
55 unless($res->content=~m{
56 \n
57 (
58 <dl>
59 .*?
60 (?:\n.*?)??
61 profanity </a>\n
62 (?:.+\n)*?
63 </dl>\n
64 )
65 \n
66 }ix) {
67 print “can’t find profanity DL
in\n”, $res->content;
68 next;
69 }
70 my $prof = $1;
71
72 ## look for title:
73 unless($res->content=~ m{
74 <title>
75 (?:
76 SCREEN\s+ IT!
\s+\S+ \s+
REVIEW:\s+
77 )?
78 (
79 .+
80 )
81 </title>
82 }ix) {
83 print “can’t find title in\n”,
$res->content;
84 next;
85 }
86 my $title = $1;
87 print “… $title\n”; # for tracing
88
89 ## save data:
90 $DATA{$url} = “$title\n$prof”;
91 }
92 }
93
94 print scalar keys %DATA, ” total movies
for the quiz!\n”;



Randal L. Schwartz is the chief Perl guru at Stonehenge Consulting and co-author of Learning Perl and Programming Perl. He can be reached at merlyn@stonehenge.com. Code listings for this column can be found at: http://www.stonehenge.com/merlyn/LinuxMag/.

Comments are closed.