The Apache Software Foundation, is friggin awesome. They come up with some really great things, moreso they have an excellent list of open source projects which have really had a huge impact everywhere.  Apache Web Server is one the most widely used servers out there coughing up HTML/CSS/PHP to the world wide mass.

Windows 7 Apache Nutch Apache Tomcat

Apache Nutch is an open source web crawler, and builds on Apache Solr and Lucene technologies.

Tomcat is a Java Servlet container also an Apache project impementing Java Servlets, and JavaServer Pages.  For many large companies it’s a mission critical application.  If you don’t know what a servlet is, think of an outlet mall you shop at.  One building contains many different stores.  Tomcat, like an outlet mall, allows you to contain many different applications.

Now, the purpose of this little write up is to show you how to successfully install Apache Nutch and Tomcat on a Windows 7 machine, granted you pay attention and your current machine isn’t configured funky.

To start out, if your already running a WAMP stack, roll down a few steps because I’m going to take the easy way out and install XAMPP from Apachefriends.  I chose this route simply because it includes Tomcat in the installation, and speeds up the process a bit.  Lets look at what you need to get for all of this:

XAMPP from ApacheFriends : Click Here to goto website

Java SE JDK 6 from Oracle : Click to goto website

Cygwin: Click here to goto website

Apache Nutch 1.2: Click here to goto website

So get XAMPP

Install it.

After XAMPP installs, go ahead and configure it as you would your normally, setting passwords and all that jazz.

Next go to Oracle………..and download the Java SE JDK 6 package for whatever architecture your Windows installation is (eg. x64).  Install the package.  Now that you’ve installed the Java JDK, you can run and test Tomcat which should have installed when you installed XAMPP.  Goto the command line and navigate to the directory where you installed XAMPP.  Run the command catalina_start.bat. When you execute the command, don’t close the command window as Tomcat is going to run in it if everything went fine.

Broswe to localhost:8080, and you should see Tomcats main page.  Make sure to make an exception for your firewall, if your running one, to allow Tomcat.

Tomcat Index Page

Tomcat Index Page

Tomcat at default, has users disabled.  You need to modify the file tomcat-users.xml to assign yourself the role of manager, and create a username and password.  You need this in order to access the /manager application of Tomcat. Tomcat-users.xml is located in the Tomcat/conf directory.

 

Copy the above code and place it in the tomcat-users.xml file editing the appropriate places. (Note: If there are other entries between the <tomcat-users></tomcat-users> tags you can comment them out or delete them).

If you’ve already downloaded Nutch, go ahead and extract the application somewhere easy, I put mine right in the XAMPP directory next to all my other services.  Now you need to set an environment variable, and there are a couple ways to do this,  for this example I’m going to use Cygwin, but you can also set environment variables from the Windows Command line via SET and even set them in your Control Panel System Settings.

Install, and open Cygwin.  At the command prompt change into the directory you extracted Nutch into.

Now setting the environment variable is easy, we need to set JAVA_HOME to equal the path to the Java SE JDK

Next, create a directory in your Nutch directory called “urls” and then create a text file in that directory, you can name it anything you want.

In the text file, enter in domains you want to crawl, on each line…

Open the crawl-urlfilter textfile located in your Nutch/conf directory, edit it like so

+^http://([a-z0-9]*\.)*circuitbomb.com/

Next, open the file nutch-site.xml located in your nutch/conf directory.  This file contains information related to your crawler, and you need to setup some information for it to relay to websites you crawl.

  http.agent.name
  Pangalactic Gargleblaster
  Pangalactic Gargleblaster
 
  http.agent.description
  Very intoxicating
  Very intoxicating
 
  http.agent.url
  http://circuitbomb.com
  http://circuitbomb.com
 
  http.agent.email
  DigitalMail
  complaintdepartment@http://circuitbomb.com

Now we get to run the crawler!  Go back to Cygwin, you should still be in the Nutch directory.  To run the crawler, we are going to type in a command, tell Nutch what to crawl,  where to store the crawled indexs, how deep to crawl, and the total number of pages to crawl up to the depth.  Here is an overview of some primary options:

  • -dir dir is the name of the directory to store the crawl index
  • -depth depth is the link depth from the root page to crawl
  • -topN topN is the maximum number of pages to crawl on each level up to the depth
  • -threads threads is the number of threads that will fetch in parallel

The command to run a basic crawl is: bin/nutch crawl urls -dir crawl -depth 3 -topN 50

If all went accordingly, you should end up with Nutch crawling the urls you specified, seeing something like this happening:

Awesome.  Now we need a way to search the crawled pages, this is really easy.  Go to your Nutch directory in explorer and copy the nutch-1.2.war file, paste the file in your tomcat/webapps directory.  You can also upload it through the manager interface of tomcat if you’d like, but since we are on a localhost I don’t see the point.  Regardless you still need to go into your Tomcat manager, refresh, and deploy Nutch.  A new directory should have been created in the Tomcat/webapps directory named after the .war file open it and go to the WEB-INF directory, and the into classes.  Open the nutch-site.xml file and add a property between the configuration tags to tell the web app where to find the crawled indexes.

    searcher.dir
    C:\xampp\Nutch\crawl\ or the path to the directory you specified to store the index

Save the file and reload the Nutch application from the Tomcat manager, afterwards goto the Nutch application clicking its link in manager and you should be greeted with Nutch Search.

Cool.

, , , , , , ,
Trackback

23 comments untill now

  1. Do you need to install Apache “Ant” to run Nutch?

  2. No, you don’t need Apache Ant unless you build Nutch from source. In which case, the version being used here is the verison 1.2 binary release.

  3. arif_wic @ 2011-05-17 01:58

    Have you ever use Nutch on Windows in real productions?

  4. I have not used Nutch in a live production environment, only for personal use with Linux and Windows.

  5. Thanks for your answer! It helps give a better understanding to this newbie in building Nutch -Ad

  6. arif_wic @ 2011-05-18 01:57

    Thanks for your reply, and it’s running well for me. Just finding a little bit different, for option -depth 3, I must create empty folder named ’3′, without this Nutch will complaining there is a missing path

  7. How can i change some code and re generate the war file to deploy in apache tomcat?

  8. Hi Leo. If your just looking to edit the web front-end to Nutch you should stop the application in Tomcat, edit the files you want, and then re-deploy it in Tomcat. Changes should be reflected afterwards.

  9. Thats interesting arif, because the depth option should only signify the link depths you want to crawl. -dir directory should be the option for setting the directory where the crawl index/database is stored. Weird.

  10. http://zillionics.com/resources/articles/NutchGuideForDummies.htm
    [by following above]
    how do i push nutch data into my solr db to search it more conveniently. i use many of the tricks and procedures but none of them would work for me.
    like this command:
    bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments

    it give me errors like that
    input path does not exist: file:/ c:/wamp/nutch-1.2/crawl/segments/crawl_fetch
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_parse
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_data
    input path does not exist: file:/c:/wamp/nutch-1.2/crawl/segments/crawl_text

    in segments folder, folders like 20110606102332, 20110606102455, 20110606102814 are created in and all the above file are in it

    i also provide the side link [like:
    bin/nutch solrindex http://127.0.0.1:8080/solr/ crawl/crawldb crawl/linkdb crawl/segments/20110606102332]

    then it will give me the an error like this:
    java.io.IOException: Job failed!

  11. Followed your steps, but had some issues.
    Setting JAVA_HOME to equal the path to the Java SE JDK is going to pose some problems. Cygwin does not enjoy spaces in file names or file paths. So using “Program Files” as the default directory for the JDK causes cygwin to interprets the space as a delimeter.

  12. Hi Troy,

    Setting JAVA_HOME shouldn’t be set to “Program Files” alone but rather the jre6. Checking again using the method I described above did not generate any problems.

  13. Hi Sam,

    Sorry I can’t help you with your problem at the moment, I have yet to use Solr with Nutch.

  14. You need to replace “Program Files” with “Progra~1″ (or whatever dos name is) if you have that as part of Java_HOME path. For example, if your java bin directory is in “C:\Program Files\Java\jdk1.6.0_24\bin”, then your JAVA_HOME should be “C:\Progra~1\Java\jdk1.6.0_24″ for nutch to work in cygwin.

  15. Well compiled tutorial for the initial startup. I have one issue. There is no WAR file in the nutch directory. I am using Nutch 1.3. I checked all the versions src / bin / tar.gz. None of them have the WAR file. Even the build.xml is not having “war” target in it and no war task inside the build.xml file. So, I am stuck at this point to execute a search on the crawldb. How can I get the war file?

  16. I really liked your explanation aboutn using Nutch. I´m wondering if it is possible to use the Nutch over Jboss instead of Apache.

    I´m starting to configure the environment and I´m not sure it this is going to be possible without TomCat. OS Windows and Eclipse.

  17. Worthful tutorial for installing nutch search engine in windows 7.Thanks a lot.I tried this but finally if i enter any data in the nutch search engine it is displaying that “Hits 0-0 (out of about 0 total matching pages): “.What can i do for getting the exact result???pls do reply me….

  18. hi,i am using nutch 1.2 and i an successfully make it crawl but there is no linkdb file created. Is there a problem?

  19. Nguyen Manh Dung @ 2011-11-14 07:22

    @kelvin:do you have folder crawl(it often made in …/nutch-1.2/crawl.If it does,you check in folder crawl,if you crawling right,it should has 5 folders inside:crawldb,index,indexes,linkdb & segments.Or you may delete the old folder crawl and re-crawling.Hope 4 help!
    @Varsha:is displaying “Hits 0-0 (out of about 0 total matching pages)as you said,i think it might cause you use the word which doesn’t appear in domains you crawled.You can try one more time with the word that you surely in the things you crawled.Hope 4 help!

  20. can I get RSS feed using nutch crawler?

  21. blunderboy @ 2012-03-15 15:20

    Hi,
    Has anyone worked on apache nutch 1.4
    I am facing an issue.
    After following the above mentioned instructions, I am getting an error
    $ bin/nutch crawl urls -dir crawl -depth 3 -topN 5
    cygpath: can’t convert empty path
    Error: Could not find or load main class org.apache.nutch.crawl.Crawl

    I have tried a lot but could not find the reason.
    Please tell me how to solve this error

    Thanks :)

  22. nutch newbie @ 2012-11-06 23:14

    I’m new to Nutch and Apache in general, so please forgive my ignorant questions. I ran the command catalina_start.bat. In the command window I see the various commands execute but I’m getting some errors with Address Already in use: JVM_Bind and a bit later I see a sever error
    SEVERE: Failed to initialize connector [Connector[HTTP/1.1-8080]]

    Any words of advice? What can I do to get around this issue? Thanks in advance.

  23. Make sure your not already running a service on port 8080

Add your comment now