| |||
Из Бразилии ! Hi. A rather useful information, especially for those who need highly specific internet searches. AbГs. g. Guilherme R Basilio Rio de Janeiro, Brazil http://www.transconsult.com.br Tel: +55 21 3387 1819 ------------------------------ <http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html> Invisible Web: What it is, Why it exists, How to find it, and Its inherent ambiguity UC Berkeley - Teaching Library Internet Workshops What is the "INvisible Web"? The "visible web" is what you see in the results pages from general web search engines. It''s also what you see in almost all subject directories. The "invisible web" is what you cannot retrieve ("see" in the search results and other links contained in these types of tools. * Searchable Databases. Most of the invisible web is made up of the contents of thousands of specialized searchable databases that you can search via the Web. The search results from many of these databases are delivered to you in web pages that are generated just in answer to your search. Such pages very often are not stored anywhere: it is easier and cheaper to dynamically generate the answer page for each query than to store all the possible pages containing all the possible answers to all the possible queries people could make to the database. Search engines cannot find or create these pages. More explanation. * Excluded Pages. There are some types of pages that search engine companies exclude by policy. There is no technical reason they could not include them if they wanted. It''s a matter of selecting what and what not to include in databases that are already huge, expensive to operate, and whose search function is a low revenue producer. More explanation. How to Find the Invisible Web Simply think "databases" and keep your eyes open. You can find searchable databases and other invisible web stuff in the course of routine searching in most general web directories. Of particular value in academic research are * Librarians Index * AcademicInfo * Infomine Use Google and other search engines to locate searchable databases by searching a subject term and the word "database". If the database uses the word database in its own pages, you are likely to find it in Google. The word "database" is also useful in searching a topic in the Yahoo! directory, because Yahoo! sometimes uses the term to describe searchable databases in its listings. EXAMPLES for Google & Yahoo!: plane crash database languages database toxic chemicals database Remember that the Invisible Web exists. Remember that, in addition to what you find in search engine results and most directories, there are these gold mines you have to search directly. As part of your wise web search strategy, spend a little time looking for databases in your field or topic of study or research. The table below may help if you cannot find anything using the suggestions above. Selected Directories of Searchable Databases: Table of Features The suggestions listed above are probably your best bet for locating invisible web pages. There used to be a series of directories that purported to lead you to the invisible web, but most of them have deteriorated into a mixture of visible and invisible sites. This is in part due to the fact that the phrase, "invisible web," is less trendy and less apt to attrach users to a site. Two useful sites for exploring what the invisible web has to offer are listed here: * The Invisible Web Directory <http://www.invisible-web.net> Directory of searchable databases. Use by browsing subjects; not searchable. "This site is a companion to The Invisible Web: Finding Hidden Internet Resources Search Engines Can''t See, by Chris Sherman and Gary Price." * Direct Search <http://www.freepint.com> Several long pages listing and describing searchable databases on many academic topics. Pick the section or page from the links near the top. If you search, keep searches simple, because the search tool is not very good. Done by Gary Price, an academic librarian with research experience. HARD TO USE. Except in Business and Economics, I would use the other directories above. Why Are Some Pages Invisible There are two reasons a search engine does not contain a page: (I) technical barriers that prohibit access and (II) choices or decisions to exclude. I. Technical barriers: TYPING and/or JUDGMENT are required. If the only way to access web pages requires you to type something or scan a page and select a combination of options, search engines are unable to proceed. WHY? Search engine databases are created by robot programs called spiders, computer robot programs that crawl the web seeking search engine content. These spiders crawl or navigate the Web by following the links in the web pages that are already in the database of their parent search engine. If there is no link to a page, a spider cannot "see" it. They lack the ability to type or think of any string of characters. They also cannot scan a set of options and choose which one to select. They not only lack fingers for typing, but also lack a brain capable of judgment. Pages created as the result of a search are called "dynamically generated" pages. The answer to your query is encased in a web page designed to carry the answer and sent to your computer. Often the page is not stored anywhere afterward, because its unique content (the answer to your specific query) is probably not of use to many other people. It''s easier for the database to regenerate the page when needed than to keep it around. The opposite of a "dynamic" page is a "static" page. Static pages reside on servers, each identified by a unique URL, and waiting to be retrieved when their URL is invoked. Spiders can find a static page if it is linked to in any other page they "know" about. They follow links to it and retrieve it much as you would by clicking if you knew the link. Static pages are not invisible, although search engines might choose to omit them for policy reasons discussed below. The content of a lot web pages is both searchable and browsable by clicking on links. To the extent that the content found by searching is replicated in web pages with links somewhere, part (or all) of the content might be found in a general Web search engine (unless the pages are excluded from a search engine for policy reasons, discussed below). Any content not contained in static pages linked to somewhere else remains invisible. You have to search a database directly to find these. The inability for spiders to type and think causes two types of Invisible Web pages: Category 1: Content from searchable specialized databases can be entirely or partially invisible or visible, depending on how much is contained in static pages with links. EXAMPLES of sites with searchable databases include most search engines like Google or Northern Light or AltaVista. The contents of all online library catalogs that do not require a password (like UCB''s Pathfinder) are also invisible web. The results of your searches are dynamically generated. It is sometimes possible to retain that humongous URL on top of your search result, and use it to regenerate the page dynamically when you click on it. But the results pages are not stored anywhere. (Why don''t search engine''s contain the links to results in other search engines? See the section below on links search engine spiders won''t touch). An EXAMPLE of a site in this course with contents that are accessible both by searching directly and by links accessible to search engine spiders is Yahoo! and many other directories organized as searchable but also providing access to their contents by browsing (following links). Spiders have to approach the contents following the slow browse/link approach, whereas you can type searches or browse. If a search engine wants to, it could get to all of the information in Yahoo! that is accessible by following links. This is true for many such sites, but see the section below about links search engine spiders won''t touch. Category 2: Password or login required. All sites requiring a password or login are closed to search engine spiders because they require typing something spiders cannot "know." The contents of these sites are very unlikely to be in any general Web search engine. This includes all of the passworded resources that exist (at UCB, we have hundreds of indexing services, encyclopedias, directories, and other web-based resources that you need some kind of password to access; there are thousands more web sites where all or part of the site requires a password because the site is not free or restricted use for other reasons). II. Pages search engines choose to exclude: FORMAT of the page. Search engines may choose not to include pages because the format of the document would be infrequently or unsuccessfully searched by the users of the search engine. There is no technical reason they must exclude them -- only a policy made by most search engine companies. WHY? Search engine databases and spiders are optimized to "read" HTML, the basic language of the Web. These other types of programming languages contain codes and format requirements that are incompatible with HTML. HTML can carry links to these pages, but not full text of their content in their special format. Pages with images and no text are also often omitted because, without text, there is nothing for you to do a keyword search on to find the image; so why bother to include it? Category 3: Pages formatted in PDF and other pages written using very little if any HTML text. Search engines also have a hard time indexing the contents of documents in Flash, Shockwave, and other programs like Word, WordPerfect, PowerPoint etc. Pages consisting almost entirely of images are often excluded as well. EXCEPTIONS: * Google now provides the ability to search the full text of many PDF files by converting these files to text, and encasing the text in HTML so it can work like an ordinary web page in the Google database. You search matching on the text "translation," and see in the results a link to the original complete PDF document. Other search engines currently do not provide this service. (Try it searching "form 1040" in Google. Click on the little "Text version" and "PDF" links.) * The image databases that Google, AltaVista, and other search engine companies offer are structured to handle these types of files with less text. SCRIPT-BASED pages: Links containing a ?. A script is a type of programming language that can be used to fetch and display web pages. There are many kinds and uses of scripts on the Web. They can be used to create all or part of a web page, and to communicate with searchable databases. Many of the database queries and responses discussed in Category 1 use scripts. When you find a question mark (?) in the URL of page, some kind of script command is used in that page. Most search engines are instructed not to crawl sites or include pages that use script technology, although it is often technically possible for them to do so. This is a another policy decision. WHY? If spiders encounter a ? in a URL or link, they are programmed to back off. They could encounter poorly written script or intentional "spider traps" designed to ensnare spiders, sometimes bogging them down in infinite loops that run up the cost and time it takes for spiders to do their work. So search engine companies instruct their spiders not to retrieve (i.e., put in the search engine) pages with URLs containing ?. This may result in the contents of an entire site using scripts being excluded from a search engine, or a search engine may crawl safe part of a site and omit others. A spider doesn''t have the freedom and creativity that you have to jump around a site intelligently. Category 4: Script-based pages, bearing a tell-tale ? in their URL:. EXAMPLES of databases whose contents are entirely script-generated Google, . There are no static URLs on these sites for the kinds of things you can access by searching, and, if there were, search engine spiders would choose not to index them. They are doubly invisible (once because they fall in Category 1, and once because excluded by policies). An EXAMPLE of a site partially using scripts is the Librarians'' Index. Some of the links in the browsable directory that starts on on the home page are script-based (containing ?), and some are not. Google and some other search engines contain the pages without any ?, but not any that do contain a ? The LII page for Automobile <http://lii.org/search/file/automobiles> is in Google; The LII page for Motorcycles <http://lii.org/search?title=Motorcycles> ; query=Motorcycles; searchtype=subject) is not. Note the question mark. Google''s spider is technically capable or retrieving both pages by following their links, just as you can by clicking on it. But, because of the ?, it omits it the "Motorcycles" page. In Yahoo! directory, when you click on links (the way a spider would have to) there are no ? in the resulting URLs. But if you search Yahoo!, the URLs all contain ? indicating scripts. Guess which URLs you will find in a search engine? The Ambiguity Inherent in the Invisible Web: It is very difficult to predict what sites or kinds of sites or portions of sites will or won''t be part of the Invisible Web. There are several factors involved: o Which sites replicate some of their content in static pages (hybrid of visible and invisible in some combination)? o Which replicate it all (visible in search engines if you construct a search matching the page)? o Which replicate none and must be searched directly (totally invisible)? o You often don''t know if a page has a ? in its URL until after you''ve somehow found it (excluded by policy). o Search engines can change their policies on what the exclude and include. Want to learn more about the Invisible Web? * A recent book: Gary Price & Chris Sherman. The Invisible Web : Uncovering Information Sources Search Engines Can''t See. CyberAge Books, July 2001. ISBN 091096551X (Paper $29.95). o The companion directory to the book is The Invisible Web Directory (see above). o Excerpts from chapters 4 and 6 of this book have been adapted or reprinted in: Gary Price & Chris Sherman. "Premier(e) Books: The Invisible Web," SEARCHER [magazine], vol. 9, no. 6, June 2001. Pages 62-74. o NOTE: I disagree with some of the overly confusing explanations in this article: I think the authors make the Invisible Web more complicated than it is. * A smart discussion may be found at: Robert J. Lackie, Those Dark Hiding Places: The "Invisible Web" Revealed <http://library.rider.edu/scholarly/rlackie/Invisible/Inv_Web.html> * Other links of possible interest on the Invisible Web are available under this topic in About.com. * SearchAbility. Descriptions of many directories and lists of searchable databases, extensively annotated, rated, and described. Excellent background on specialized searchable databases on the web. -- No virus found in this message and attachments. N_o foram encontrados virus nesta mensagem e anexos. Checked by AVG Anti-Virus. Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 2/5/2005 _______________________________________________ Mt-list mailing list |