The Apache Software Foundation, is friggin awesome. They come up with some really great things, moreso they have an excellent list of open source projects which have really had a huge impact everywhere. Apache Web Server is one the most widely used servers out there coughing up HTML/CSS/PHP to the world wide mass.
Apache Nutch is an open source web crawler, and builds on Apache Solr and Lucene technologies.
Tomcat is a Java Servlet container also an Apache project impementing Java Servlets, and JavaServer Pages. For many large companies it’s a mission critical application. If you don’t know what a servlet is, think of an outlet mall you shop at. One building contains many different stores. Tomcat, like an outlet mall, allows you to contain many different applications.
Now, the purpose of this little write up is to show you how to successfully install Apache Nutch and Tomcat on a Windows 7 machine, granted you pay attention and your current machine isn’t configured funky.
To start out, if your already running a WAMP stack, roll down a few steps because I’m going to take the easy way out and install XAMPP from Apachefriends. I chose this route simply because it includes Tomcat in the installation, and speeds up the process a bit. Lets look at what you need to get for all of this:
XAMPP from ApacheFriends : Click Here to goto website
Java SE JDK 6 from Oracle : Click to goto website
Cygwin: Click here to goto website
Apache Nutch 1.2: Click here to goto website
So get XAMPP
After XAMPP installs, go ahead and configure it as you would your normally, setting passwords and all that jazz.
Next go to Oracle………..and download the Java SE JDK 6 package for whatever architecture your Windows installation is (eg. x64). Install the package. Now that you’ve installed the Java JDK, you can run and test Tomcat which should have installed when you installed XAMPP. Goto the command line and navigate to the directory where you installed XAMPP. Run the command catalina_start.bat. When you execute the command, don’t close the command window as Tomcat is going to run in it if everything went fine.
Broswe to localhost:8080, and you should see Tomcats main page. Make sure to make an exception for your firewall, if your running one, to allow Tomcat.
Tomcat at default, has users disabled. You need to modify the file tomcat-users.xml to assign yourself the role of manager, and create a username and password. You need this in order to access the /manager application of Tomcat. Tomcat-users.xml is located in the Tomcat/conf directory.
Copy the above code and place it in the tomcat-users.xml file editing the appropriate places. (Note: If there are other entries between the <tomcat-users></tomcat-users> tags you can comment them out or delete them).
If you’ve already downloaded Nutch, go ahead and extract the application somewhere easy, I put mine right in the XAMPP directory next to all my other services. Now you need to set an environment variable, and there are a couple ways to do this, for this example I’m going to use Cygwin, but you can also set environment variables from the Windows Command line via SET and even set them in your Control Panel System Settings.
Install, and open Cygwin. At the command prompt change into the directory you extracted Nutch into.
Now setting the environment variable is easy, we need to set JAVA_HOME to equal the path to the Java SE JDK
Next, create a directory in your Nutch directory called “urls” and then create a text file in that directory, you can name it anything you want.
In the text file, enter in domains you want to crawl, on each line…
Open the crawl-urlfilter textfile located in your Nutch/conf directory, edit it like so
Next, open the file nutch-site.xml located in your nutch/conf directory. This file contains information related to your crawler, and you need to setup some information for it to relay to websites you crawl.
http.agent.name Pangalactic Gargleblaster Pangalactic Gargleblaster http.agent.description Very intoxicating Very intoxicating http.agent.url http://circuitbomb.com http://circuitbomb.com http.agent.email DigitalMail complaintdepartment@http://circuitbomb.com
Now we get to run the crawler! Go back to Cygwin, you should still be in the Nutch directory. To run the crawler, we are going to type in a command, tell Nutch what to crawl, where to store the crawled indexs, how deep to crawl, and the total number of pages to crawl up to the depth. Here is an overview of some primary options:
- -dir dir is the name of the directory to store the crawl index
- -depth depth is the link depth from the root page to crawl
- -topN topN is the maximum number of pages to crawl on each level up to the depth
- -threads threads is the number of threads that will fetch in parallel
The command to run a basic crawl is: bin/nutch crawl urls -dir crawl -depth 3 -topN 50
If all went accordingly, you should end up with Nutch crawling the urls you specified, seeing something like this happening:
Awesome. Now we need a way to search the crawled pages, this is really easy. Go to your Nutch directory in explorer and copy the nutch-1.2.war file, paste the file in your tomcat/webapps directory. You can also upload it through the manager interface of tomcat if you’d like, but since we are on a localhost I don’t see the point. Regardless you still need to go into your Tomcat manager, refresh, and deploy Nutch. A new directory should have been created in the Tomcat/webapps directory named after the .war file open it and go to the WEB-INF directory, and the into classes. Open the nutch-site.xml file and add a property between the configuration tags to tell the web app where to find the crawled indexes.
searcher.dir C:\xampp\Nutch\crawl\ or the path to the directory you specified to store the index
Save the file and reload the Nutch application from the Tomcat manager, afterwards goto the Nutch application clicking its link in manager and you should be greeted with Nutch Search.