duff
Joined: 26 Mar 2007
Posts: 153
|
here is the proposal BEP for bittorrent.org.
I need feedback:
BEP: XXX
Title: Distributed Search Engine
Version: $Revision$
Last-Modified: $Date$
Author: Olivier Dufour <olivier(dot)duff(at)gmail(dot)com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 02-May-2009
Post-History:
Introduction
============
The goal of this BEP is to provide a search engine in the bittorrent client.
We have ever this in big tracker site. And some client use it with rss but it is far from perfect.
Use the DHT
===========
In order to provide a distributed search engine we must use the current DHT which the only network not link to torrent hash.
So this proposal will be done into the DHT.
DHT have a unique id whereas here we can have few word to describe a single file.
My idea is to make an hashcode from each word and distribute it over a second hashtable with dht. (The first one come from BEP 5)
We will need some new packets.
we will have to restrict the minimal search on 3 characters and we do substring to spread keyword which can be a subpart of the original keyword.
//TODO use token?
Broadcast File Owning
=====================
"q" == "own_file" //TODO or announce_file
- "keyword" = keyword hashcode
- "name" = name of file
- "infohash" = infohash
- "category"(optional) = Video, Audio, Image, or Data
send "get_peers" request to get closer node until find the closer node to hashcode of keyword
then send the broadcast of own_file.
//maybe have to use another message than get_peers new message???
When receiving this message the node have to store in an hashtable with key equal to keyword hashcode and value equal
to a list of structure which contain name, infohash, and category.
This structure can evolve for need if some optional data is add the own_file request.
Search File
===========
"q" == "search_file"
A search File query have x arguments:
- "search" = keyword hashcode
- "filter" = filter (other keyword of the search sentence)
- "category"(optional) = Video, Audio, Image, or Data
if the node who receive this packet do not have a dictionnary
over the search word it respond with closer note ( with xor distance) to hash of keyword.
Else it return the search response:
"q" == "search_responce"
"dict" = dictionnary of name + infohash + category
the dictionary is build with the list of value found for the key in hashtable filter with category and other keywords of search sentence.
Category
========
we have to cut it in category too.
we can do it with extension:
zip rar deb rpm, ... =>data
mov, avi, mpeg, divx ..=> videos
mp3, wma, ogg vorbis ..=> audios
TODO : Image
Issue, ogg can be audio or video ;(((
anyway we can filter on extention with category too ;)
HashCode
========
hashcode must be done from the search word.
This hashcode must not been the .NET hashcode.
issue: what special character of some language:
germany : ß
France: é è à ç ô ö...
I think we must avoid all species to keep a simple alphabet.
The issue is with unicode language (japanese kana, arabic, ...)
2cd point is that our hash function must return 20 bytes.
usually hashcode must be very different when 2 words are closes.
But I do not like this idea here because if we have a file with name "small dog" (always the same ;)) we have to warn all distributed hash table that we have a file for : "small", "dog", "sma", smal".(remember we cut the word) the issue is that we have to warn a 4 nodes. So best is to have closier node to avoid a lot of packet on network...
Distance will be the same as kademlia DHT: XOR.
Acknowledgements
================
Thanks to Alan Mac Govern for his help.
References
==========
.. [#BEP-5] BEP_0005. DHT Protocol, Loewenstern
(http://www.bittorrent.org/beps/bep_0005.html)
Copyright
=========
This document has been placed in the public domain.
..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:
|