MonoTorrent Forum Index
RegisterSearchFAQMemberlistUsergroupsfChatLog in
Reply to topic Page 1 of 1
distributed search engine
Author Message
Reply with quote
Post distributed search engine 
Hi,

what is a missing features in bittorrent is search features.

E-mule have it.
We have some solution with rss from some website/tracker but it is not perfect.

So what is the goal of a p2p search engine: looking for a file with some keyword with less impact on network.
To reduce that we need to find the quicker way to find node who know who have the file.
It is the same issue than DHT. The issue is that DHT have a unique id wheras here we can have few word to describe a single file.

My idea is made a hashcode from each word and distribute it over a second hashtable with dht.
We will need some new packets.

2 issues with that system:
- we search with a boolean OR not a boolean AND as user expect to have. Anyway we can find a solution for that with a a list of word to filter in search packet.
- no approximation

Example
"small dog"
Send a search to DHT for "small" and for "dog".
we find all about small and all words about dog.
We have to search only in lower-case, numbers and some other car=> restricted caractères list.
The second issue is that we will not found "smallest dogs" for example....

else we will have to do minimal search on 3 characters and we do substring to spread keyword which can be a subpart of the original keyword.

This is just an idea to extend the DHT.

Feal free to post feed back.
I will try to write a document about it

View user's profile Send private message
Reply with quote
Post  
we have to cut it in category too.

we can do it with extetion:
zip rar deb rpm, ... =>data
mov, avi, mpeg, divx ..=> videos
mp3, wma, ogg vorbis ..=> audios

Issue, ogg can be audio or video ;(((

anyway we can filter on extention with category too ;)

View user's profile Send private message
Reply with quote
Post  
hashcode must be done from the search word.
This hashcode must not been the .NET hashcode.
issue: what special character of some language:
germany : ß
France: é è à ç ô ö...

I think we must avoid alls species to keep a simple alphabet.
The issue is with unicode language (japanese kana, arabic, ...)

2cd point is that our hash function must return 20 bytes.
usually hashcode must be very diffrent when 2 words are closes.
But I do not like this idea here because if we have a file with name "small dog" (always the same ;)) we have to warn all distributed hash table that we have a file for : "small", "dog", "sma", smal".(remember we cut the word) the issue is that we have to warn a 4 nodes. So best is to have closier node to avoid a lot of packet on network...

Distance will be the same as kad : XOR.

View user's profile Send private message
Reply with quote
Post  
here is the proposal BEP for bittorrent.org.
I need feedback:
BEP: XXX
Title: Distributed Search Engine
Version: $Revision$
Last-Modified: $Date$
Author: Olivier Dufour <olivier(dot)duff(at)gmail(dot)com>
Status: Draft
Type: Standards Track
Content-Type: text/x-rst
Created: 02-May-2009
Post-History:


Introduction
============

The goal of this BEP is to provide a search engine in the bittorrent client.
We have ever this in big tracker site. And some client use it with rss but it is far from perfect.


Use the DHT
===========

In order to provide a distributed search engine we must use the current DHT which the only network not link to torrent hash.
So this proposal will be done into the DHT.
DHT have a unique id whereas here we can have few word to describe a single file.

My idea is to make an hashcode from each word and distribute it over a second hashtable with dht. (The first one come from BEP 5)
We will need some new packets.

we will have to restrict the minimal search on 3 characters and we do substring to spread keyword which can be a subpart of the original keyword.
//TODO use token?

Broadcast File Owning
=====================
"q" == "own_file" //TODO or announce_file
- "keyword" = keyword hashcode
- "name" = name of file
- "infohash" = infohash
- "category"(optional) = Video, Audio, Image, or Data

send "get_peers" request to get closer node until find the closer node to hashcode of keyword
then send the broadcast of own_file.
//maybe have to use another message than get_peers new message???
When receiving this message the node have to store in an hashtable with key equal to keyword hashcode and value equal
to a list of structure which contain name, infohash, and category.
This structure can evolve for need if some optional data is add the own_file request.

Search File
===========

"q" == "search_file"
A search File query have x arguments:
- "search" = keyword hashcode
- "filter" = filter (other keyword of the search sentence)
- "category"(optional) = Video, Audio, Image, or Data

if the node who receive this packet do not have a dictionnary
over the search word it respond with closer note ( with xor distance) to hash of keyword.
Else it return the search response:
"q" == "search_responce"
"dict" = dictionnary of name + infohash + category

the dictionary is build with the list of value found for the key in hashtable filter with category and other keywords of search sentence.

Category
========

we have to cut it in category too.

we can do it with extension:
zip rar deb rpm, ... =>data
mov, avi, mpeg, divx ..=> videos
mp3, wma, ogg vorbis ..=> audios
TODO : Image
Issue, ogg can be audio or video ;(((

anyway we can filter on extention with category too ;)

HashCode
========

hashcode must be done from the search word.
This hashcode must not been the .NET hashcode.
issue: what special character of some language:
germany : ß
France: é è à ç ô ö...

I think we must avoid all species to keep a simple alphabet.
The issue is with unicode language (japanese kana, arabic, ...)

2cd point is that our hash function must return 20 bytes.
usually hashcode must be very different when 2 words are closes.
But I do not like this idea here because if we have a file with name "small dog" (always the same ;)) we have to warn all distributed hash table that we have a file for : "small", "dog", "sma", smal".(remember we cut the word) the issue is that we have to warn a 4 nodes. So best is to have closier node to avoid a lot of packet on network...

Distance will be the same as kademlia DHT: XOR.

Acknowledgements
================

Thanks to Alan Mac Govern for his help.

References
==========

.. [#BEP-5] BEP_0005. DHT Protocol, Loewenstern
(http://www.bittorrent.org/beps/bep_0005.html)

Copyright
=========

This document has been placed in the public domain.



..
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
coding: utf-8
End:

View user's profile Send private message
Display posts from previous:
Reply to topic Page 1 of 1


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum