|
|
 |
|
 |
|
Sharehound is an open-source network file systems indexer and Google-like searcher written in Java and based on Apache Lucene. Currently supported are SMB file shares (i.e. MS Windows-based network shares) and FTP resources. Web UI is used for search and crawl monitoring. 日本語 is supported! :) Here are some screenshots:Files and directories are indexed and searched by their pathes and some other external attributes, not contents. Historical data (deleted files and offline hosts info) is indexed which allows to have full-fledged RSS notifications about added, changed and deleted files in user's search results.
The latest release can be downloaded here. I see following possible uses of this program:
* File search by Windows LAN's shares, e.g. for files in a corporate local network. A sharehound server can be deployed at corporate network to lower inbound external traffic: people would search distributives they need in a local net first and download them locally if found. As a variation this could be implemented as an extension for corporate proxy server but I don't think this is really necessary - people would do so themselves as local download is usually faster.
* Web-interface for remote computer's file system listing (e.g. to browse file system of home computer from remote office)
* RSS notification about files of interest, e.g. new music at given computer, set of computers or the whole net
* FTP server search and changes notification companion. I think this can be interesting for FTP site owners.
* RSS notification for updates in version control system. There are no full support for VCSs yet but you can just point sharehound to (shared) file system snapshot of a repository (Subversion, CVS, Perforce - whatever) and get RSS notification when set of files or their attributes (date, size) in a given dir (and, optionally, in its subdirs) is changed.
Some docs:
* Installation instructions
* Some facts about the application usage problems and their solutions, plus optional config parameters
* Technical description of the project
* changelog
* Build instructions for developers
* license information
If you've tried Sharehound and found some bug or want some new feature, please report it at project users' mailing list at Sourceforge. You can also submit bug reports and feature requests, but make sure you don't repeat existing items, so read them before posting new one please.
If you're a developer and feel like you want to contribute to this project, write me about it, or just write the code and send a patch for a start. Better talk with me first so we make sure our efforts don't duplicate.
harehound: network file systems index and search software.
Purpose
The aim of the project is to create extensible indexing and searching engine for hierarchical network-accessed hierarchical file storages such as LAN shares, FTP servers, version control systems. Files and directories are indexed and searched by their external attributes only, i.e. path, name, size, modification date. Only pathes are now used for search as such, rest of the attributes are used for sorting the results. Sharehound doesn't currently search files by their contents and/or internal metadata, though that's not impossible to implement. :)
The project is currently under development.
Implementation.
Language:
The system is written in Java (JDK 1.5) with use of following libraries:
* Samba jCIFS for accessing MS Windows-based network file shares
* Quartz scheduling library
* Apache Lucene for storing, indexing and searching files info
OS platforms:
any with JDK 1.5 available
Features.
Indexer/crawlerService
Currently supported indexed resources are MS Windows-based network file shares.
The system includes crawling/indexing component with crawl adimin Web UI and searching Web UI. An RSS interface to the search results is available.
The current output of the crawling/indexing component is a Lucene index containing trackable file list (i.e. there is information on when and which files were added, deleted, modified and what files currently cannot be accessed).
Crawling component can also output data to relational database through Hibernate (just new HibernateGateway() should be given to corresponding task - a code would be like that of LucenedIpRangeSmbCrawlQuartzJob). The relational database choice is currently defined in Hibernate config file and thus can be made by user. Currently MySql and PostgreSql connection settings and SQL scripts are included.
Initially I thought that relational database would be used as primary files info storage and then it would be indexed by Lucene or something; later I refused this idea because relational databases with amounts like 3mlns records appeared unmaintenable on shy machines I use for development and Lucene covers my storing needs satisfactory. Hibernate indexing branch is a bit of out of date now and will probably go further way out.
File access protocol implementation. CIFS is currently supported via Samba jCIFS. Other file accessing protocols implementations (now I'm thinking of FTP and version control systems) are made pluggable.
Parameters and scheduling of crawling jobs are configurable through XML file (of Quartz).
Searcher
Searcher is implemented as Web UI. Search can be narrowed to one directory and (optionally) its subdirectories. Common Lucene rules for searching apply. In particular, wildcard queries are supported but query must not begin with wildcard. Only relative file path is searched, host name is out of scope. The search results can be subscribed through an RSS feed (the square radar image under the results list). Note that what you see at results list is what you get at the RSS feed, in particular sorting (by default it's by search results rating) and page size are important. All the search parameters including sorting and page size are present in URL and can be changed there both for Web and RSS presentation.
Sharehound installation instructions.
This document describes how to deploy and set up sharehound under Apache Tomcat web container. I use Tomcat 5.5; setup for other versions and containers is probably similar to this but I'm not sure. This document describes only those configuration parameters that require setting at install time. Additional optional config parameters are listed in "somefacts.html" file. This instructions assume you use the binary distribution zip, i.e. sharehound-[version].zip.
1. Steps to start sharehound:
* - Get Tomcat distributive somewhwere and set up it. You can refer to Tomcat docs but just unzipping it to some dir will possibly do. You will also need JDK 1.5.x to run Tomcat, or JRE for later Tomcat versions - check your Tomcat docs. If you plan to touch the code you'll surely need the JDK.
* - make some necessary changes to sharehound's config files (see below for details)
* - copy "sharehound" dir to ${tomcat-dir}/webapps Or, to make sharehound the root application (so it would be acessed by http://yourserver:8080/), copy "sharehound" dir contents to ${tomcat-dir}/webapps/ROOT directory (and remove everything from there first!)
* - create "logs" dir under ${tomcat-start-dir} (if it doesn't exists yet). Usually ${tomcat-start-dir} is ${CATALINA_HOME} but I prefer to start Tomcat from ${tomcat-root} - I created facade run.bat there (see below) and have all Tomcat logs in one dir.
* - you might need to tune JVM memory properties or you can get Java "OutOfMemory" error. I use "-Xmx128m" at home (450M in index, 3 computers) and "-Xmx600m" at work (10G in index, about 300 computers). This can be set in a number of places, i.e. in a newly created Tomcat's run.bat like this placed in tomcat root directory:
-- begin run.bat --
set JAVA_OPTS=-Xmx600m
bin/catalina run
-- end run.bat --
* - run Tomcat. Crawl jobs will start according to their configuration (see quartz-config.xml section below).
2.Required configuration. The application configuration files in question are located at sharehound/WEB-INF/classes directory.
2.1 lucene.properties file specifies where to place files generated by crawl. index.directory property specifies where to place Lucene indexes (they can be prety big: at my work lan there are 3mln indexed files and it takes about 2G of index files); filesequences.directory property specifies where to place ID sequences files (they are small - a pair of files by 8 bytes).
2.3 quartz-config.xml file describe crawl jobs. Take the one present in sharehound/WEB-INF/classes directory and modify it as you need. Another config samples are present in sharehound/WEB-INF/classes/quartz-config-examples directory. The file is Quartz (opensource Java scheduling library) jobs configuration file that is possibly described at Quartz web site.
Important elements:
* job-class. It can contain "org.sourceforge.sharehound.tasks.quartz.SingleRootCrawlJob" or "org.sourceforge.sharehound.tasks.quartz.IpRangeCrawlQuartzJob".
* job-data-map specifies job parameters. Some are common to all jobs: "crawl-latency" parameter is used to slow down crawlerService to utilize less CPU cycles (it's rather hungry beast). "account-properties" parameter specifies *.account file that contains account information for connection; "protocol" - can be 'smb' or 'ftp' (without qoutes).
SingleRootCrawlJob-specific job-data-map parameters: "search-root" parameter is specific to SingleRootCrawlJob job and gives the starting host to index, e.g. smb://192.168.1.198. The "smb://" or "ftp://" protocol prefix is mandatory. IpRangeCrawlQuartzJob-specific job-data-map parameters:
o "from-ip" - starting IP of a range to index
o "to-ip" - ending IP of a range to index
o "exclude" - host to exclude from indexing. I use this at work for crawlerService not to index my own machine's admin's hidden shares like "C$" - I use my domain account to crawl thorough the LAN. Note that you should use prefix "smb://" - this is current inconvenience, I'm always surprised to discover it.
o "threads-number" - number of threads to use for indexing shares from given IP range. One thread will be used for each host. Note that when all the threads are crawling through online hosts they can utilize pretty much CPU (see "crawl-latency" parameter and crawl admin UI).
SMB-specific job-data-map parameters: filter-class - can be "org.sourceforge.sharehound.net.smb.HiddenSmbSharesFilterImpl". This will exclude hidden shares and files from being indexed (they are indexed by default).
* trigger element specifies job scheduling. name subelement should probably be unique; job-name must refer to name under job-detail of the same job; repeat-count specifies number of job repeat (-1 is endless repeating; 0 is for "start-once"); repeat-interval specifies interval between job starts in milliseconds.
Be sure not to start two jobs so that they will index the same host at the same time else the index will get duplicates (i.e. corrupted in our case). There's no internal checks for this currently.
2.4 Crawl admin UI.
It's available at http://yourserver:${tomcat-port}/sharehound/admin.do. It allows to set crawling latency for jobs that are running and that will be run so you can adjust CPU utilization by sharehound's crawlers in runtime.
The admin page is protected by web container's security. You have to be recognized by web container as having "admin" role to access the admin page. To set up it in Tomcat edit ${tomcat-dir}/conf/tomcat-users.xml, define "admin" role there and add users to it like this: .
2.5 Search UI.
It's an app's default page available at http://yourserver:${tomcat-port}/sharehound/. It have such obvious searching UI features as paging and sorting.
For unicode queries to work you should tune Tomcat a bit: in ${tomcat-dir}/conf/server.xml put useBodyEncodingForURI="true" attribute to HTTP/1.1 Connector element. This change will come in effect after Tomcat restart.
Download
|
|
 |
|
 |
|
|