Skip to main content

nutch - what i did different

http://zillionics.com/resources/articles/NutchGuideForDummies.htm

3 Setup the crawler:
Create a directory called urls to hold the a text file with urls inside of it.
becomes:
Create a directory called seedlist to hold the text file with urls inside of it.

In this directory, create the text file with any name you like. Put any URL’s line by line. This is the crawler’s “shopping list”.

becomes:
c:\nutch-1.0\seedlist\seedlist.txt = http://www.neocodesoftware.com/

4. Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with
+^http://([a-z0-9]*\.)*neocodesoftware.org/

Edit the file conf/nutch-site.xml. insert at minimum following properties into it and edit in proper values for the properties:

http.agent.name
neocode





http.agent.description
neocodeagent





http.agent.url
http://www.neocodeosftware.com





http.agent.email
sales@neocodesoftware.com





searcher.dir
crawl



so
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
becomes
bin/nutch crawl seedlist -dir crawl -depth 3 -topN 50

plus i made this:
cd \nutch
net stop "apache tomcat 6"
rmdir /s /q crawl
net start "apache tomcat 6"

Comments

Popular posts from this blog

How to change default calendar for new events in Lightning

https://getsatisfaction.com/mozilla_messaging/topics/how_set_default_calendar_for_new_events_in_lightning Edited version Open Tools > Options > Advanced tab, and click Config Editor button. In the "Filter:" box enter "calendar.registry"  Find a .calendar-main-default key - it will be set to true Other calendars either won't have a .calendar-main-default key (or it will be set to false) Right click on the value of the .calendar-main-default key that goes with the calendar that currently shows up by default in new events to toggle the value to false Click on the .calendar-main-in-composite key that goes with the calendar you want as default Right click on the same key and choose Copy Name from the menu that appears. Now right click on the key again and select New > Boolean Paste the name of the key and Use the backspace key to erase "in-composite" and type "default" Click OK and Choose true and click OK Now exit out of