Invisible Web  

Из Бразилии !
Hi.

A rather useful information, especially for those
who need highly specific internet searches.

AbГs.
g.

Guilherme R Basilio
Rio de Janeiro, Brazil

http://www.transconsult.com.br

Tel: +55 21 3387 1819


------------------------------
<http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html>

Invisible Web: What it is, Why it exists, How to
find it, and Its inherent ambiguity

UC Berkeley - Teaching Library Internet Workshops


What is the "INvisible Web"?

The "visible web" is what you see in the results
pages from general web search engines. It''s also
what you see in almost all subject directories.

The "invisible web" is what you cannot retrieve
("see" in the search results and other links
contained in these types of tools.

* Searchable Databases. Most of the invisible web
is made up of the contents of thousands of
specialized searchable databases that you can
search via the Web. The search results from many
of these databases are delivered to you in web
pages that are generated just in answer to your
search. Such pages very often are not stored
anywhere: it is easier and cheaper to dynamically
generate the answer page for each query than to
store all the possible pages containing all the
possible answers to all the possible queries
people could make to the database. Search engines
cannot find or create these pages. More explanation.

* Excluded Pages. There are some types of pages
that search engine companies exclude by policy.
There is no technical reason they could not
include them if they wanted. It''s a matter of
selecting what and what not to include in
databases that are already huge, expensive to
operate, and whose search function is a low revenue producer. More explanation.


How to Find the Invisible Web

Simply think "databases" and keep your eyes open.
You can find searchable databases and other
invisible web stuff in the course of routine
searching in most general web directories. Of
particular value in academic research are

* Librarians Index
* AcademicInfo
* Infomine

Use Google and other search engines to locate
searchable databases by searching a subject term
and the word "database". If the database uses the
word database in its own pages, you are likely to
find it in Google. The word "database" is also
useful in searching a topic in the Yahoo!
directory, because Yahoo! sometimes uses the term
to describe searchable databases in its listings.

EXAMPLES for Google & Yahoo!:
plane crash database
languages database
toxic chemicals database

Remember that the Invisible Web exists. Remember
that, in addition to what you find in search
engine results and most directories, there are
these gold mines you have to search directly.

As part of your wise web search strategy, spend a
little time looking for databases in your field
or topic of study or research. The table below
may help if you cannot find anything using the
suggestions above. Selected Directories of
Searchable Databases: Table of Features

The suggestions listed above are probably your
best bet for locating invisible web pages. There
used to be a series of directories that purported
to lead you to the invisible web, but most of
them have deteriorated into a mixture of visible
and invisible sites. This is in part due to the
fact that the phrase, "invisible web," is less
trendy and less apt to attrach users to a site.


Two useful sites for exploring what the invisible
web has to offer are listed here:

* The Invisible Web Directory <http://www.invisible-web.net>

Directory of searchable databases. Use by browsing subjects; not searchable.

"This site is a companion to The Invisible Web:
Finding Hidden Internet Resources Search Engines
Can''t See, by Chris Sherman and Gary Price."

* Direct Search <http://www.freepint.com>

Several long pages listing and describing
searchable databases on many academic topics.
Pick the section or page from the links near the
top. If you search, keep searches simple, because
the search tool is not very good. Done by Gary
Price, an academic librarian with research
experience. HARD TO USE. Except in Business and
Economics, I would use the other directories above.



Why Are Some Pages Invisible

There are two reasons a search engine does not
contain a page: (I) technical barriers that
prohibit access and (II) choices or decisions to exclude.

I. Technical barriers:

TYPING and/or JUDGMENT are required. If the only
way to access web pages requires you to type
something or scan a page and select a combination
of options, search engines are unable to proceed.

WHY? Search engine databases are created by robot
programs called spiders, computer robot programs
that crawl the web seeking search engine content.
These spiders crawl or navigate the Web by
following the links in the web pages that are
already in the database of their parent search
engine. If there is no link to a page, a spider
cannot "see" it. They lack the ability to type or
think of any string of characters. They also
cannot scan a set of options and choose which one
to select. They not only lack fingers for typing,
but also lack a brain capable of judgment.

Pages created as the result of a search are
called "dynamically generated" pages. The answer
to your query is encased in a web page designed
to carry the answer and sent to your computer.
Often the page is not stored anywhere afterward,
because its unique content (the answer to your
specific query) is probably not of use to many
other people. It''s easier for the database to
regenerate the page when needed than to keep it around.

The opposite of a "dynamic" page is a "static"
page. Static pages reside on servers, each
identified by a unique URL, and waiting to be
retrieved when their URL is invoked. Spiders can
find a static page if it is linked to in any
other page they "know" about. They follow links
to it and retrieve it much as you would by
clicking if you knew the link. Static pages are
not invisible, although search engines might
choose to omit them for policy reasons discussed below.

The content of a lot web pages is both searchable
and browsable by clicking on links. To the extent
that the content found by searching is replicated
in web pages with links somewhere, part (or all)
of the content might be found in a general Web
search engine (unless the pages are excluded from
a search engine for policy reasons, discussed
below). Any content not contained in static pages
linked to somewhere else remains invisible. You
have to search a database directly to find these.

The inability for spiders to type and think
causes two types of Invisible Web pages:

Category 1: Content from searchable specialized
databases can be entirely or partially invisible
or visible, depending on how much is contained in static pages with links.

EXAMPLES of sites with searchable databases
include most search engines like Google or
Northern Light or AltaVista. The contents of all
online library catalogs that do not require a
password (like UCB''s Pathfinder) are also
invisible web. The results of your searches are
dynamically generated. It is sometimes possible
to retain that humongous URL on top of your
search result, and use it to regenerate the page
dynamically when you click on it. But the results
pages are not stored anywhere. (Why don''t search
engine''s contain the links to results in other
search engines? See the section below on links
search engine spiders won''t touch).

An EXAMPLE of a site in this course with contents
that are accessible both by searching directly
and by links accessible to search engine spiders
is Yahoo! and many other directories organized as
searchable but also providing access to their
contents by browsing (following links). Spiders
have to approach the contents following the slow
browse/link approach, whereas you can type
searches or browse. If a search engine wants to,
it could get to all of the information in Yahoo!
that is accessible by following links. This is
true for many such sites, but see the section
below about links search engine spiders won''t touch.

Category 2: Password or login required. All sites
requiring a password or login are closed to
search engine spiders because they require typing
something spiders cannot "know." The contents of
these sites are very unlikely to be in any
general Web search engine. This includes all of
the passworded resources that exist (at UCB, we
have hundreds of indexing services,
encyclopedias, directories, and other web-based
resources that you need some kind of password to
access; there are thousands more web sites where
all or part of the site requires a password
because the site is not free or restricted use for other reasons).


II. Pages search engines choose to exclude:

FORMAT of the page. Search engines may choose not
to include pages because the format of the
document would be infrequently or unsuccessfully
searched by the users of the search engine. There
is no technical reason they must exclude them --
only a policy made by most search engine companies.

WHY? Search engine databases and spiders are
optimized to "read" HTML, the basic language of
the Web. These other types of programming
languages contain codes and format requirements
that are incompatible with HTML. HTML can carry
links to these pages, but not full text of their
content in their special format. Pages with
images and no text are also often omitted
because, without text, there is nothing for you
to do a keyword search on to find the image; so why bother to include it?

Category 3: Pages formatted in PDF and other
pages written using very little if any HTML text.
Search engines also have a hard time indexing the
contents of documents in Flash, Shockwave, and
other programs like Word, WordPerfect, PowerPoint
etc. Pages consisting almost entirely of images are often excluded as well.

EXCEPTIONS:

* Google now provides the ability to search the
full text of many PDF files by converting these
files to text, and encasing the text in HTML so
it can work like an ordinary web page in the
Google database. You search matching on the text
"translation," and see in the results a link to
the original complete PDF document. Other search
engines currently do not provide this service.
(Try it searching "form 1040" in Google. Click on
the little "Text version" and "PDF" links.)

* The image databases that Google, AltaVista, and
other search engine companies offer are
structured to handle these types of files with less text.

SCRIPT-BASED pages: Links containing a ?. A
script is a type of programming language that can
be used to fetch and display web pages. There are
many kinds and uses of scripts on the Web. They
can be used to create all or part of a web page,
and to communicate with searchable databases.
Many of the database queries and responses
discussed in Category 1 use scripts. When you
find a question mark (?) in the URL of page, some
kind of script command is used in that page. Most
search engines are instructed not to crawl sites
or include pages that use script technology,
although it is often technically possible for
them to do so. This is a another policy decision.

WHY? If spiders encounter a ? in a URL or link,
they are programmed to back off. They could
encounter poorly written script or intentional
"spider traps" designed to ensnare spiders,
sometimes bogging them down in infinite loops
that run up the cost and time it takes for
spiders to do their work. So search engine
companies instruct their spiders not to retrieve
(i.e., put in the search engine) pages with URLs
containing ?. This may result in the contents of
an entire site using scripts being excluded from
a search engine, or a search engine may crawl
safe part of a site and omit others. A spider
doesn''t have the freedom and creativity that you
have to jump around a site intelligently.

Category 4: Script-based pages, bearing a tell-tale ? in their URL:.

EXAMPLES of databases whose contents are entirely
script-generated Google, . There are no static
URLs on these sites for the kinds of things you
can access by searching, and, if there were,
search engine spiders would choose not to index
them. They are doubly invisible (once because
they fall in Category 1, and once because excluded by policies).

An EXAMPLE of a site partially using scripts is
the Librarians'' Index. Some of the links in the
browsable directory that starts on on the home
page are script-based (containing ?), and some
are not. Google and some other search engines
contain the pages without any ?, but not any that do contain a ?

The LII page for Automobile
<http://lii.org/search/file/automobiles> is in Google;

The LII page for Motorcycles
<http://lii.org/search?title=Motorcycles> ;
query=Motorcycles; searchtype=subject) is not. Note the question mark.

Google''s spider is technically capable or
retrieving both pages by following their links,
just as you can by clicking on it. But, because
of the ?, it omits it the "Motorcycles" page.

In Yahoo! directory, when you click on links (the
way a spider would have to) there are no ? in the
resulting URLs. But if you search Yahoo!, the
URLs all contain ? indicating scripts. Guess
which URLs you will find in a search engine?



The Ambiguity Inherent in the Invisible Web:

It is very difficult to predict what sites or
kinds of sites or portions of sites will or won''t
be part of the Invisible Web. There are several factors involved:

o Which sites replicate some of their content in
static pages (hybrid of visible and invisible in
some combination)? o Which replicate it all
(visible in search engines if you construct a
search matching the page)? o Which replicate none
and must be searched directly (totally
invisible)? o You often don''t know if a page has
a ? in its URL until after you''ve somehow found
it (excluded by policy). o Search engines can
change their policies on what the exclude and include.



Want to learn more about the Invisible Web?

* A recent book: Gary Price & Chris Sherman. The
Invisible Web : Uncovering Information Sources
Search Engines Can''t See. CyberAge Books, July
2001. ISBN 091096551X (Paper $29.95).

o The companion directory to the book is The
Invisible Web Directory (see above).

o Excerpts from chapters 4 and 6 of this book
have been adapted or reprinted in: Gary Price &
Chris Sherman. "Premier(e) Books: The Invisible
Web," SEARCHER [magazine], vol. 9, no. 6, June 2001. Pages 62-74.

o NOTE: I disagree with some of the overly
confusing explanations in this article: I think
the authors make the Invisible Web more complicated than it is.

* A smart discussion may be found at: Robert J.
Lackie, Those Dark Hiding Places: The "Invisible
Web" Revealed
<http://library.rider.edu/scholarly/rlackie/Invisible/Inv_Web.html>

* Other links of possible interest on the
Invisible Web are available under this topic in About.com.

* SearchAbility. Descriptions of many directories
and lists of searchable databases, extensively
annotated, rated, and described. Excellent
background on specialized searchable databases on the web.


--
No virus found in this message and attachments.
N_o foram encontrados virus nesta mensagem e anexos.
Checked by AVG Anti-Virus.
Version: 7.0.308 / Virus Database: 266.11.2 - Release Date: 2/5/2005




_______________________________________________
Mt-list mailing list


Хостинг от uCoz