The terms Deep Web, Hidden Web, Invisible web and Deep Net describe the portion of the World Wide Web that is not visible to the public or has not been indexed by the search engines. Some portions of the deep web consist of dynamic pages accessible only via a form or submitted query. Web pages that are not linked to other pages are also part of the deep web. They are, in effect, invisible; search engine crawlers will not be able to find them since they have no backlinks or inbound links.
Sites that require registration prior to access can also be considered part of the deep net. These sites block the search engine spiders from browsing and indexing their web pages through protocols such as the Robots Exclusion Standard. Furthermore, pages created by Flash and JavaScript, scripted content as well as non-text content or non-HTML file formats in Usenet archives such as PDF and DOC documents are indexed only by some search engines. This makes them part of the Hidden Web.
Crawler Limitations
A search engine’s web crawler uses hyperlinks to uncover and index content found on the Web. This tactic is ineffective in a search of deep web resources. For instance, search engine crawlers do not look for dynamic web pages that result from queries of databases because there are may be a lot of possible results.
New Innovations
These limitations are, however, being overcome by the new search engine crawlers (like Pipl) being designed today. These new crawlers are designed to identify, interact and retrieve information from deep web resources and searchable databases. Google, for example, has developed the mod oai and Sitemap Protocol in order to increase results from deep web searches of web servers. These new developments will allow the web servers to automatically show the URLs that they can access to search engines.
Another solution that is being developed by several search engines like Alacra, Northern Light and CloserLookSearch are specialty search engines that focus only in particular topics or subject areas. This would allow the search engines to narrow their search and make a more in-depth search of the deep web by querying password-protected and dynamic databases.
Deep web or Surface Web
The challenge that researchers in this field face is related to the classification of resources. The area between the surface web and the deep web is a gray area. There are sites that appear to be indexed by search engines but are actually found not by conventional web crawlers but by OAIster, mod_oai or sitemap protocol. Other examples are pages that are found in the surface web but are not yet found by web crawlers.
The research being done in this field of computer science today will be able to provide Internet users more access to the deep web data as well as more meaningful results for their searches. Researchers are currently looking for a way to classify and categorize search results by topics and according to the users’ needs.
Follow Us!