Search engines allow the Internet to be interpreted in a meaningful way, as otherwise one would have to waste a lot of time finding information.
As essential tools when surfing online, developers have been constantly preoccupied with improving these utilities.
Norconex HTTP Collector
is one such auxiliary tool that can be employed to crawl sites quickly and return results to a local folder or feed them directly to a search engine.
The application supports multi-threaded operations, thus ensuring that adequate results are received with little time being wasted. This ability can be especially useful when dealing with particularly large websites.
Once a target has been specified, the program automatically attempts to detect the language and text can be extracted from all the attached pictures and PDFs, as the library has support for OCR tasks.
Other formats, such as HTMLs and Office documents are supported and the spider can also process canonical URLs.
Several settings can be customized when starting jobs, such as the ability to adjust the crawling speed; also, one can configure the crawler to treat embedded documents as distinct files and hierarchical fields can also be built.
Filtering output documents can be performed based on URL or HTTP headers and metadata information can also be employed towards this end.
For ease of use, several samples are available, allowing developers or users to assess the power of the tool accurately.
A concise online manual can be perused to solve many issues and the forums can also be employed to ensure one obtains good results.