The Extreme Searcher`s Internet Hanbook P2

4 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK 1979 The first Usenet discussion groups are created by Tom Truscott, Jim Ellis, and Steve Bellovin, graduate students at Duke University and the University of North Carolina. It quickly spreads worldwide. The first emoticons (smileys) are suggested by Kevin McKenzie. 1980s The personal computer becomes a part of millions of people’s lives. There are 213 hosts on ARPANET. BITNET (Because It’s Time Network) is started, providing e-mail, electronic mailing lists, and FTP service. CSNET (Computer Science Network) is created by computer sci- entists at Purdue University, the University of Washington, RAND Corporation, and BBN, with National Science Foundation (NSF) support. It provides e-mail and other networking serv- ices to researchers who did not have access to ARPANET. 1982 The term “Internet” is first used. TCP/IP is adopted as the universal protocol for the Internet. Name servers are developed, allowing a user to get to a computer without specifying the exact path. There are 562 hosts on the Internet. France Telecom begins distributing Minitel terminals to subscribers free of charge, providing videotext access to the Teletel system. Initially providing telephone directory lookups, then chat and other services, Teletel is the first widespread home implementation of these types of network services. Orwell’s vision, fortunately, is not fulfilled, but computers are soon to be in almost every home. There are over 1,000 hosts on the Internet. 1985 The WELL (Whole Earth ‘Lectronic Link) is started. Individual users, outside of universities, can now easily participate on the Internet. There are over 5,000 hosts on the Internet. 1986 NSFNET (National Science Foundation Network) is created. The backbone speed is 56K. (Yes, as in the total transmission capabil- ity of a 56K dial-up modem.) 1987 There are over 10,000 hosts on the Internet. B ASICS FOR THE S ERIOUS S EARCHER 5 1988 The NSFNET backbone is upgraded to a T1 at 1.544Mbps (megabits per second). 1989 There are over 100,000 hosts on the Internet. ARPANET goes away. There are over 300,000 hosts on the Internet. 1991 Tim Berners-Lee at CERN (Conseil European pour la Recherché Nucleaire) in Geneva, introduces the World Wide Web. NSF removes the restriction on commercial use of the Internet. The first gopher is released, at the University of Minnesota, which allows point-and-click access to files on remote computers. The NSFNET backbone is upgraded to a T3 (44.736 Mbps). 1992 There are over 1,000,000 hosts on the Internet. Jean Armour Polly coins the phrase “surfing the Internet.” 1994 The first graphics-based browser, Mosaic, is released. Internet talk radio begins. WebCrawler, the first successful Web search engine is introduced. A law firm introduces Internet “spam.” Netscape Navigator, the commercial version of Mosaic, is shipped. 1995 NSFNET reverts back to being a research network. Internet infra- structure is now primarily provided by commercial firms. RealAudio is introduced, meaning that you no longer have to wait for sound files to download completely before you begin hearing them, and allowing for continued (“streaming”) downloads. Consumer services such as CompuServe, America Online, and Prodigy begin to provide access through the Internet instead of only through their private dial-up networks. 1996 There are over 10,000,000 hosts on the Internet. 1999 Microsoft’s Internet Explorer overtakes Netscape as the most popular browser. Testing of the registration of domain names in Chinese, Japanese, and Korean languages begins, reflective of the internationaliza- tion of Internet usage. 2001 Mysterious monolith does not emerge from the Earth and no evil computers take over any spaceships (as far as we know). 2002 Google is indexing more than 3 billion Web pages. 2003 There are more than 200,000,000 hosts on the Internet. 6 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK Internet History Resources Anyone interested in information on the history of the Internet beyond this selective list is encouraged to consult the following resources. A Brief History of the Internet, version 3.1 http://www.isoc.org/internet-history By Barry M. Leiner, Vinton G. Cerf, David D. Clark, Robert E. Kahn, Leonard Kleinrock, Daniel C. Lynch, Jon Postel, Larry G. Roberts, Stephen Wolff. This site provides historical commentary from many of the actual people who were involved in the creation of the Internet. Internet History and Growth http://www.isoc.org/internet/history/2002_0918_Internet_History_and_ Growth.ppt By William F. Slater. This PowerPoint presentation provides a good look at the pioneers of the Internet and provides an excellent collection of statistics on Internet growth. Hobbes’ Internet Timeline http://www.zakon.org/robert/internet/timeline This detailed timeline emphasizes technical developments and who was behind them. S EARCHING THE I NTERNET : W EB “F INDING T OOLS ” Whether your hobby or profession is cooking, carpentry, chemistry, or any- thing in-between, you know that the right tool can make all the difference. The same is true for searching the Web. A variety of tools are available to help you find what you need, and each does things a little differently, sometimes with different purposes and different emphases, as well as different coverage and different search features. To understand the variety of tools, it can be helpful to think of most finding tools as falling into one of three categories (although many tools will be hybrids). These three categories of tools are (1) general directories, (2) search engines, and (3) specialized directories. The third category could indeed be lumped in with the first because both are directories, but for a couple of reasons discussed later, it is worthwhile to separate them. B ASICS FOR THE S ERIOUS S EARCHER 7 All three of these categories may incorporate another function, that of a por- tal, a Web site that provides a gateway not only to links, but to a number of other information resources going beyond just the searching or browsing func- tion. These resources may include news headlines, weather, professional direc- tories, stock market information, a glossary, alerts, and other kinds of handy information. A portal can be general, as in the case of Yahoo!’s My Yahoo!, or it can be specific for a particular discipline, region, or country. Other finding tools serve other kinds of Internet content, such as news- groups, mailing lists, images, and audio. These tools may exist either on sites of their own or they may be incorporated into the three main categories of tools. These specialized tools will be covered in later chapters. General Web Directories The general Web directories are Web sites that provide a large collection of links arranged in categories to enable browsing by subject area, such as Yahoo!, Open Directory, and LookSmart. Their content is (usually) hand picked by human beings who ask the question: “Is this site of enough interest to enough people that it should be included in the directory?” If the answer is yes (and in some cases, if the owner of the site has paid a fee), the site is added and placed in the directory’s database (catalog) and is listed in one or more of the subject categories. As a result of this process, these tools have two major characteristics: They are selective (sites have had to meet the selection criteria), and they are categorized (all sites are arranged in categories—see Figure 1.1). Because of the selectivity, the user of these directories is working, theoretically, with higher quality sites—the wheat and not the chaff. Because the sites included are arranged in categories, the user has the option of starting at the top of the hierarchy of categories and browsing down until the appropriate level of specificity is reached. Also, usually only one entry is made for each site, instead of including, as in search engines, many pages from the same site. The size of the database of general Web directories is much smaller than that created and used by Web search engines, the former containing usually 2 to 3 million sites and the latter from 1 to 3 billion pages. Web directories are designed primarily for browsing and for general questions. Sites on very spe- cific topics, such as “UV-enhanced dry stripping of silicon nitride films” or “social security retirement program reform in Croatia” are generally not included. As a result, directories are most successfully used for general, 8 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK Figure 1.1 Yahoo!’s Main Directory Page rather than specific questions, for example, “Types of Chemical Reactions” or “social security.” Although browsing through the categories is the major design idea behind general Web directories, they do provide a search box to allow you to bypass the browsing and go directly to the sites in the database. When to Use a General Directory TI P : General Web directories are a good starting place when you have a very general question (museums in Paris, dyslexia), or when you don’t quite If your question know where to go with a broad topic and would like to browse down through contains one or a category to get some guidance. two concepts, General Web directories are discussed in detail in Chapter 2. consider a directory. If it contains three or Web Search Engines Whereas a directory is a good start when you want to be directed to just a more, definitely few selected items on a fairly general topic, search engines are the place to go start with a when you want something on a fairly specific topic (ethics of human cloning, search engine. Italian paintings of William Stanley Haseltine). Instead of searching brief B ASICS FOR THE S ERIOUS S EARCHER 9 descriptions of 2 to 3 million Web sites, these services allow you to search virtually every word from 2 to 3 billion Web pages. In addition, Web search engines allow you to use much more sophisticated techniques, allowing you to much more effectively focus in on your topic. The pages included in Web search engines are not placed in categories (hence, you cannot browse a hier- archy), and no prior human selectivity was involved in determining what is in the search engine’s database. You, as the searcher, provide the selectivity by the search terms you choose and by the further narrowing techniques you may apply. When to Use Search Engines If your topic is very specific or you expect that very little is written on it, a search engine will be a much better starting place than a directory. If you need to be exhaustive, use a search engine. If your topic is a combination of three or more concepts (e.g., “Italian” “paintings” “Haseltine”), use a search engine. (See Chapter 4 for more details on search engines.) Figure 1.2 Web Search Engine—AllTheWeb’s Advanced Search Page 10 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK Specialized Directories (Resource Guides, Research Guides, Metasites) Specialized Web directories are collections of selected Internet resources (collections of links) on a particular topic. The topic could range from something as broad as medicine to something as specific as biomechanics. These sites go by a variety of names such as resource guides, research guides, metasites, cyberguides, and webliographies. Although their main function is to provide links to resources, they often also incorporate some additional portal features such as news headlines. Indeed, this category could have been lumped in with the general Web directories, but it is kept separate for two main reasons. First, the large general directories, such as Yahoo! and Open Directory, all have a number of things in common besides being general. They all provide categories you can browse, they all also have a search feature, and when you get to know them, they all tend to have the same “look and feel” in other ways as well. The second main reason for keeping the specialized directories as a separate category is that they deserve greater attention than they often get. More searchers need to tap into their extensive utility. When to Use Specialized Directories Use specialized directories when you need to get to know the Web litera- ture on a topic, in other words, when you need a general familiarity with the major resources for a particular discipline or a particular area of study. These sites can be thought of as providing some immediate expertise in using Web resources in the area of interest. Also, when you are not sure of how to narrow your topic and would like to browse, these sites can often be better starting places than a general directory because they may reflect a greater expertise in the choice of resources for a particular area than would a general directory, and they often include more sites on the specific topic than are found in the corresponding section of a general directory. Specialized directories are discussed in detail in Chapter 3. G ENERAL S TRATEGIES First, there is no right or wrong way to search the Internet. If you find what you need and find it quickly, your strategy is good. Keep in mind, though, that B ASICS FOR THE S ERIOUS S EARCHER 11 finding what you need involves issues such as Was it really the correct answer?, Was it the best answer?, and Was it the complete answer? At the broadest level, assuming that your question is one for which the Internet is the best starting place, one approach to a finding what you need on the Internet is to first answer the following three questions. 1. Exactly what is my question? (Identification of what you really need and how exhaustive or precise you need to be.) 2. What is the most appropriate tool with which to start? (See the previous sections on the categories of finding tools.) 3. What search strategy should I start with? These three steps often take place without much conscious effort and may take a matter of seconds. For instance, you want to find out who General Carl Schurz was, you go to your favorite search engine and throw in those three words. The quick-and-easy, keep-it-simple approach is often the best. Even for a more complicated question, it is often worthwhile to start with a very simple approach in order to get a sense of what is out there, then develop a more sophisticated strategy based on an analysis of your topic into concepts. Organizing Your Search by Concepts Both a natural way of organizing the world around us and a way of organizing your thoughts about a search is to think in terms of concepts. Thinking in concepts is a central part of most searches. The concepts are the ideas that must be present in order for a resultant answer to be relevant, each concept corresponding to a required criterion. Sometimes a search is so specific that a single concept may be involved, but most searches involve a combination of two, three, or four concepts. For instance, if our search is for “hotels in Albuquerque,” our two concepts are “hotels” and “Albuquerque.” If we are trying to identify Web pages on this topic, any Web page that includes both concepts possibly contains what we are looking for and any page that is missing either of those concepts is not going to be relevant. The experienced searcher knows that for any concept, more than one term present in a record (on a Web page) may indicate the presence of the concept, and these alternate terms also need to be considered. Alternate terms may include, among other things, (1) grammatical variations (e.g., electricity, electrical), (2) synonyms, near-synonyms, or closely related terms (e.g., culture, traditions), and (3) a term and its narrower terms. For an exhaustive search in which “Baltic states” 12 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK is a concept, you may want to also search for Latvia, Lithuania, and Estonia. In an exhaustive search for information on the production of electricity in the Baltic states, you would not want to miss that Web page that dealt specifically with “Production of Electricity in Latvia.” When the idea of thinking in concepts is expanded further, it naturally leads to a discussion of Boolean logic, which will be covered in Chapter 4. In the meantime, the major point here is that, in preparing your search strategy, think about what concepts are involved, and remember that, for most concepts, look- ing for alternate terms is important. A B ASIC C OLLECTION OF S TRATEGIES Just as there is no one right or wrong way to search the Internet, there can be no list of definitive steps to follow, or one specific strategy to follow, in preparing and performing every search. Rather, it is useful to think in terms of a toolbox of strategies and to select whichever tool or combination of tools seems most appropriate for the search at hand. Among the more common strategies, or strategic tools, or approaches for searching the Internet are the following: 1. Identify your basic ideas (concepts) and rely on the built-in relevance rank- ing provided by search engines. In the major search engines and many other search sites, when you enter terms, only those records (Web pages) Figure 1.3 Ranked Output B ASICS FOR THE S ERIOUS S EARCHER 13 that contain all those terms will be retrieved, and the engine will auto- matically rank the order of output based on various criteria. 2. Use simple narrowing techniques if your results need narrowing: • Add another concept to narrow your search (instead of hotels Albuquerque, try inexpensive hotels Albuquerque) • Use quotation marks to indicate phrases when a phrase more exactly defines your concept(s) than if the words occur in different places on the page, for example, “foreign policy.” Most Web sites that have a search function allow you to specify a phrase (a combination of two or more adjacent words, in the order written) by the use of quotation marks. • Use a more specific term for one or more of your concepts (instead of intelligence, perhaps use military intelligence). • Narrow your results to only those items that contain your most important terms in the title of the page. (These kinds of techniques will be discussed in Chapter 4.) 3. Examine your first results and look for, then use, terms you might not have thought of at first. 4. If you do not seem to be getting enough relevant items, use the Boolean OR operation to allow for alternate terms, for example, electrical OR electricity would find all items that have either the term electrical or the term elec- tricity. How you express the OR operation varies with the finding tool. 5. Use a combination of Boolean operations (AND, OR, NOT, or their equivalents) to identify those pages that contain a specific combination of concepts and alternate terms for those concepts (for example, to get all pages that contain either the term cloth or the term fabric and also contain the words flax and shrinkage). As will be discussed later, Boolean is not necessarily complicated, is often implied without you doing any- thing, and can be as simple as choosing between “all of these words” or “any of these words” options. 6. Look at what else the finding tools (particularly search engines) can do to allow you to get as much as you need—and only what you need. Advanced search pages are probably the first place you should look. Ask five different experienced searchers and you will get five different lists of strategies. The most important thing is to have an awareness of the kinds of 14 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK techniques that are available to you for getting everything you need and, at the same time, only what you need. C ONTENT ON THE I NTERNET Not only the amount of information but the kinds of information available and searchable on the Internet continue to increase rapidly. In understanding what you are getting—and not getting—as a result of a search of the Internet requires consideration of a number of factors, such as the time frames covered, quality of content, and a recognition that various kinds of material exist on the Internet that are not readily accessible by search engines. In using the content found on the Internet, other issues must also be considered, such as copyright. Assessing Quality of Content TI P : A favorite complaint by those who are still a bit shy of the Internet is that the quality of information found there is often low. The same could be said about For most sites, information available from a lot of other resources. A newsstand may have both if you don’t the Economist and The National Enquirer on its shelves. On television you will immediately see find both The History Channel and infomercials. Experience has taught us how, how to get back in most cases, to make a quick determination of the relative quality of the information to the home page, we encounter in our daily lives. In using the Internet, many of the same criteria try clicking on can be successfully applied, particularly those criteria we are accustomed to the site’s logo. It applying to traditional literature resources, both popular and academic. usually works. These traditional literature evaluation techniques/criteria that can be applied in the Internet context include: 1. Consider the source. From what organization does the content originate? Look for the organization identified both on the Web page itself and at the URL. Is the content identified as coming from known sources such as a news organization, a government, an academic journal, a professional association, or a major investment firm? Just because it does not come from such a source is certainly not cause enough to reject it outright. On the other hand, even if it does come from such a source, don’t bet the farm on this criterion alone. Look at the URL. Often you will immediately be able to identify the owner. Peel back the URL to the domain name. If that does not adequately identify it, you can check details of the domain ownership for U.S. sites on sites that B ASICS FOR THE S ERIOUS S EARCHER 15 provide access to the Whois database, such as Network Solution’s (VeriSign) http://www.networksolutions.com/cgi-bin/whois/whois. For other countries, similar sites are available. Be aware that some look-alike domain names are intended to fool the reader as to the origin of the site. The top level domain (edu, com, etc.) may provide some clues about the source of the information, but do not make too many assumptions here. An edu or ac domain does not necessarily assure academic content, given that students as well as faculty can often easily get a space on the university server. A cedilla “ ~ ” in a directory name is often an indication of a personal page. Again, don’t reject something on such a criterion alone. There are some very valuable personal pages out there. Is the actual author identified? Is there an indication of the author’s cre- dentials, the author’s organization? Do a search for other things by the same author. Does she or he publish a lot on spontaneous human combustion and extraterrestrial origins of life on earth? If you recognize an author’s name and the work does not seem consistent with other things from the same author, question it. It is easy to impersonate someone on the Internet. 2. Consider the motivation. What seems to be the purpose of the site—academic, consumer protection, sales, entertainment (don’t be taken in by a spoof), political? There is, of course, nothing inherently bad (or for that matter necessarily inherently good), in any of those purposes, but identifying the motivation can be helpful in assessing the degree of objectivity. Is any advertising on the page clearly identified, or is advertising disguised as something else? 3. Look at the quality of the writing. If there are spelling and grammatical errors, assume that the same level of attention to detail probably went into the gathering and reporting of the “facts” given on the site. 4. Look at the quality of the documentation of sources cited. First, remember that even in academic circles, the number of footnotes is not a true measure of the quality of a work. On the other hand, and more importantly, if facts are cited, does the page identify the origin of the facts. If a lot rests on the information you are gathering, check out some of the cited sources to see that they really do give the facts that were quoted. 16 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK 5. Is the site and its contents as current as it should be? If a site is reporting on current events, the need for currency and the answer to the question of currency will be apparent. If the content is some- thing that should be up-to-date, look for indications of timeliness, such as a “last updated” date on the page or telling examples of outdated material. If, for example, it is a site that recommends which search engines to use, and if WebCrawler is still listed, don’t trust the currency (or for that mat- ter, accuracy) of other things on the page. What is the most recent mate- rial that is referred to? If a number of links are “dead links,” assume that the author of the page is not giving it much attention. 6. For facts you are going to use, verify using multiple sources, or choose the most authoritative source. Unfortunately, many facts given on Web pages are simply wrong, from care- lessness, exaggeration, guessing, or for other reasons. Often they are wrong because the person creating that page’s content did not check the facts. If you need a specific fact, such as the date of an historic event, look for more than one Web page that gives the date and see if they agree. Also remember that one Web site may be more authoritative than another. If you have a quotation in hand and want to find who said it, you might want to go to a source such as Bartleby.com (which includes very respected quotations sources), instead of taking the answer from Web pages of lesser-known origins. For more details and other ideas on the topic of the evaluating quality of information found on the Internet, the following two resources will be useful. The Virtual Chase: Evaluating the Quality of Information on the Internet http://www.virtualchase.com/quality Created and maintained by Genie Tyburski, this site provides an excellent overview of the factors and issues to consider when evaluating the quality of information found on a Web site. She provides checklists and links to other check- lists as well as examples of sites that demonstrate both good and bad qualities. Evaluating the Quality of World Wide Web Resources http://www.valpo.edu/library/evaluation.html This site from Valparaiso University provides a detailed set of criteria and also several dozen links to other sites that address the topic of evaluating Web resources. It also has links to exercises and worksheets on the topic. B ASICS FOR THE S ERIOUS S EARCHER 17 Retrospective Coverage of Content It is tempting to say that a major weakness of Internet content is lack of ret- rospective coverage. This is certainly an issue for which the serious user should have a high level of awareness. It is also an issue that should be put in per- spective. The importance and amount of relevant retrospective coverage avail- able depends on the kind of information you are seeking at any particular moment, and on your particular question. It is safe to say that no Web pages on the Internet were created before 1991. Books, Ancient Writings, and Historical Documents The lack of pre-1991 Web pages does not mean that earlier content is not available. Indeed, if a work is moderately well-known and was written before 1920 or so, you are as likely to find it on the Internet as in a small local public library. Take a look at the list of works included in the Project Guten- berg site and The Online Books Page (see Chapter 6) where you will find works of Cicero, Balzac, Heine, Disraeli, Einstein, and thousands of other authors. Also look at some of the other Web sites discussed in Chapter 6 for sources of historical documents. Scholarly and Technical Journals and Popular Magazines If you are looking for the full text of journal or magazine articles written several years ago, you are not likely to find them free on the Internet (and, for most journal articles, you are not even likely to find the ones written this week, last month, or last year). This lack of content is more a function of copyright and requirements for paid subscriptions than a matter of the retrospective aspect. The distinction also needs to be made here between free material and “for fee” material on the Internet. On a number of sources on the Internet (such as ingenta) you can find references to scholarly and other material going back a several years. Most likely you will need to pay to see the full text, but fees tend to be very reasonable. Whatever source you use for serious research, Internet or other, examine the source to see how far back it goes. 18 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK Newspapers and Other News Sources If, when you speak of news, you think of “new news,” retrospective coverage is not an issue. If you are looking for newspaper or other articles that go back more than a few days, the time span of available content on any particular site is crucial. In 2000, many newspapers on the Internet contained only the current day’s stories, with a few having up to a year or two of stories. For- tunately, more and more newspaper and other news sites are archiving their material, and you may find several years of content on the site. Look closely at the site to see exactly how far back the site goes. Old Web Pages A different aspect of the retrospective issue centers on the fact that many Web pages change frequently and many simply go away. Pages that existed in the early 1990s are likely to either be gone or have different content than they did then. This becomes a significant problem when trying to track down early content or citing early content. Fortunately, there are at least partial solutions to the problem. For very recent pages that may have disappeared or changed in the last few days or weeks, Google’s “cache” option may help. For Web pages in Google’s database, Google has stored a copy. If you find the refer- ence to the page in Google, but when you try to go to it, the page is either com- pletely gone, or the content that you expected to find on the page is no longer there, click on the “Cached” option and you will get to a copy of the page as it was when Google last indexed it. Even if you initially found the page else- where, search for it in Google, and if you find it there, try the cache. For locating earlier pages and their content, try the Wayback Machine. Wayback Machine—Internet Archive http://www.archive.org The Wayback Machine provides the Internet Archive, which has the pur- pose of “offering permanent access for researchers, historians, and scholars to historical collections that exist in digital format.” It allows you to search over 10 billion pages and see what a particular page looked like at various periods in Internet time. A search yields a list of what pages are available for what dates as far back as 1996. (See Figure 1.4.) As well as Web pages, it also archives moving images, texts, and audio. Its producers claim it is the largest database ever built. B ASICS FOR THE S ERIOUS S EARCHER 19 Figure 1.4 Wayback Machine Search Result Showing Pages Available in the Internet Archive for whitehouse.gov. C ONTENT —T HE I NVISIBLE W EB No matter how good you are at using Web search engines and general directories, there are valuable resources on the Web that search engines will not find for you. You can get to most of them if you know the URL, but a search engine search will probably not find them for you. These resources, often referred to as the “Invisible Web,” include a variety of content, including, most importantly, databases of articles, data, statistics, and government documents. The “invisible” refers to “invisible to search engines.” There is nothing mysterious or mystical involved. The Invisible Web is important to know about because it contains a lot of tremendously useful information—and it is large. Various estimates put the size of the Invisible Web at from two to five hundred times the content of the visible Web. Before that number sinks in and alarms you, keep in mind the following: 1. There is a lot of very important material contained in the Invisible Web. 2. For the information that is there that you are likely to have a need for, and the right to access, there are ways of finding out about it and get- ting to it. 20 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK 3. In terms of volume, most of the material is material that is meaningless except to those who already know about it, or to the producer’s immedi- ate relatives. Much of the material that can’t be found is probably not worth finding. To adequately understand what this is all about, one must know why some content is invisible. Note the use of the word “content” instead of the word “sites.” The main page of invisible Web sites is usually easy to find and is covered by search engines. It is the rest of the site (Web pages and other content) that may be invisible. Search engines do not index certain Web content mainly for the following reasons: 1. The search engine does not know about the page. No one has submitted the URL to the search engine and no pages currently covered by the search engine have linked to it. (This falls in the category, “Hardly anyone cares about this page, you probably don’t need to either.”) 2. The search engines have decided not to index the content because it is too deep in the site (and probably less useful), it is a page that changes so frequently that indexing the content would be somewhat meaningless (as, for example in the case of some news pages), or the page is generated dynamically and likewise is not amenable to indexing. (Think in terms of “Even if you searched and found the page, the content you searched for would probably be gone.”) 3. The search engine is asked not to index the content, by the presence of a robots.txt file on the site that asks engines not to index the site, or spe- cific pages, or particular parts of the site. (A lot of this content could be placed in the “It’s nobody else’s business” category.) 4. The search engine does not have or does not utilize a technology that would be required to index non-HTML content. This applies to files such as images and audio files. Until 2001, this category included file types such as PDF (Portable Document Format files), Excel files, Word files, and others, that began to be indexed by the major search engines in 2001 and 2002. Because of this increased coverage, the Invisible Web may be shrinking, proportionate to the size of the total Web. 5. The search engine cannot get to the pages to index them because it encounters a request for a password or the site has a search box that must be filled out in order to get to the content. B ASICS FOR THE S ERIOUS S EARCHER 21 It is the last part of the last category that holds the most interest for the searcher—sites that contain their information in databases. Prime examples of such sites would be phone directories, literature databases such as Medline, newspaper sites, and patents databases. As you can see, if you can find out that the site exists, then you (without going through a search engine) can search the site contents. This leads to the obvious question of where one finds out about sites that contain unindexed (Invisible Web) content. The three sites listed below are directories of Invisible Web sites. Keep in mind that they list and describe the overall site, they do not index the contents of the site. Therefore, these directories should be searched or browsed at a broad level. For example, look for “economics” not a particular economic indicator, or for sites on “safety” not “workplace safety.” As you identify sites of interest, bookmark them. You may also want to look at the excellent book on the Invisible Web by Chris Sherman and Gary Price (The Invisible Web: Uncovering Information Sources Search Engines Can’t See. CyberAge Books. Medford, NJ USA. 2001). Direct Search http://www.freepint.com/gary/direct.htm The “grandfather” of Invisible Web directories, this site was created and is main- tained by Gary Price (co-author of The Invisible Web). The sites listed here are carefully selected for quality of content, and you can either search or browse. invisible-web.net http://www.invisible-web.net By the authors of The Invisible Web, this is the most selective of the three Invisible Web directories listed here. It contains about 1,000 entries and you can either browse or search. CompletePlanet http://completeplanet.com The site claims “103,000 searchable databases and specialty search engines,” but a significant number of the sites seem to be individual pages (e.g., news articles) and many of the databases are company catalogs, Yahoo! categories, and the like, not necessarily “invisible.” It lists a lot of useful resources, but the content also emphasizes how trivial much Invisible Web material can be. 22 T HE E XTREME S EARCHER ’ S I NTERNET H ANDBOOK C OPYRIGHT Because of the seriousness of the implications of this topic, this section could extend for thousands of words. Because this chapter is about basics, though, a few general points will be made and the reader is encouraged to go for more detail to the sources listed next, which are much more authoritative and extensive on the copyright issue. If you are in a large organization, particularly an educational institution, you may want to check your orga- nization’s site for local guidelines regarding copyright. Copyright—Some Basic Points Here are some basic points to keep in mind regarding copyright. 1. “Copyright is a form of protection provided by the laws of the United States (title 17, U.S. Code) to the authors of ‘original works of authorship,’ including literary, dramatic, musical, artistic, and certain other intellectual works.” [http://www.copyright.gov/circs/circ1.html #wci] 2. Assume that what you find on a Web site is copyrighted, unless it states otherwise or you know otherwise, for example, based on the age of the item. See the U.S. Copyright Office site below for details as to the time frames for copyrights. (Of considerable use for Web page creators is the fact that “Works by the U. S. Government are not eligible for U.S. copy- right protection” [http://www.copyright.gov/circs/circ1.html# wwp]. You should still identify the source when quoting something from the site.) 3. The same basic rules that apply to using other printed material apply to using material you get from the Internet, the most important being: For any work you write for someone else to read, cite the sources you use. For more information on copyright and the Internet, see the following sources. United States Copyright Office http://lcweb.loc.gov/copyright The official U.S. Copyright Offices site, for getting copyright information (for the U.S.) directly from the horse’s mouth. (For other countries, do a search for analogous sites.) B ASICS FOR THE S ERIOUS S EARCHER 23 Copyright Web Site http://www.benedict.com This site is particularly good for addressing in laypersons’ language the issues involved in the copyright of digital materials. It also provides back- ground and discussion on some well-known legal cases on the topic. Copyright and the Internet http://mason.gmu.edu/~montecin/copyright-internet.htm For someone creating a Web page, this site from George Mason University is an excellent example of a site (written mainly for a particular institution) that provides an excellent, realistic, readable set of guidelines regarding copyright and the Internet. C ITING I NTERNET R ESOURCES The biggest problem with citing a source you find on the Internet is iden- tifying the author, the publication date, and so forth. In many cases, they just aren’t there or you have to really dig to find them. Basically, in citing Internet sources, you will just give as much of the typical citation information as you would for a printed source (author, title, publication, date, etc.), add the URL, TIP: and include a comment saying something like “Retrieved from the World Wide On virtually every Web, October 15, 2003” or “Internet, accessed October 15, 2003.” If your site, look for a reader isn’t particularly picky, just give the information about who wrote it, site index and the title (of the Web page), a date of publication if you can find it, the URL, a search box. and when you found it on the Internet. If you are submitting a paper to a journal They are often for publication, to a professor, or including it in a book, be more careful and more useful for follow whatever style guide is recommended. Fortunately, many style guides navigating a site are available online. The following two sites provide links to popular style than by means guides online. of the graphics Karla’s Guide to Citation Style Guides and links on its http://bailiwick.lib.uiowa.edu/journalism/cite.html home page. Karla Tonella provides links to over a dozen online style guides. Style Sheets for Citing Internet & Electronic Resources http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/Style.html This site provides a compilation of guidelines based on the following well- known style guides: MLA, Chicago, APA, CBE, and Turabian.

Tải về miễn phí