Decentralised Network File Sharing
#11
(Apr 17, 2018, 03:46 am)Hiroven Wrote: About collision-free algorithm. SHA1 was considered good for a long time and yet, collision was already found.

Interesting. But notice the computing power and time needed; if it's that hard to find a collision, it should be as hard to avoid them.
As I see it, yet a long way to go. I suppose a non-random number sequence (like Fourier) could be used; deterministic patterns "fill in the blanks" for collision.
If CDs and DVDs can use interleabing techniques to compress data, it is possible - But maybe not with hashing alone.

(Apr 17, 2018, 05:29 am)Mr.Masami Wrote:
(Apr 17, 2018, 05:16 am)Kingfish Wrote: Probably a bit more possible over the bittorrent protocol as each file is split into pieces so you would just be recreating the piece.

Yes, could be just a slight modification of some open source BitTorrent client to do this simple yet intensive task.
Would this make sense ? idk this is a theory thread so there's my Wink

@edit: wrote a little asm program that takes any file and generates MD5 hash for every 32 byte piece and stores them all in a binary file, now running on my server and tries to find the first piece.

Any theory is welcome. I'm learning as I go.

This brings my teenage attempt to make the lottery - I knew it would be pure luck, but a friend wanted to try.
Put an Apple II 64KB to run for the weekend to find combinations, sent him a floppy disk and said: "Now you mark all those tickets and pay for it." Tongue
Obviously he didn't have that kind of money, neither was it viable to make such a bet; to cover all numbers always cost more than the prize!

I'm curious to see the results - Just hope I live enough...
Reply
#12
According to my poor math skills 32 byte block has 2^256 combinations which equal to 115792089237316195423570985008687907853269984665640564039457584007913129639936 possible combinations.
So the hashing algorithm needs to compute 3705346855594118253554271520278013051304639509300498049262642688253220148477952 bytes which equal to 3369993333393829974333376885877453834204643052817571560137951281152 TiB of data.

Piece of cake.

[Image: e2SzWCP.png]

@edit: Tested with 2-byte pieces, reconstruction works 1:1 and fairly fast but of course no point of using pieces < hash, some special/optimized hashing algorithm just for this purpose would be needed.
Later I will try different piece size and maybe write some special hashing function for this purpose, it doesn't need to be sophisticated, actually - the simpler the better.
All we need is a good ratio between hash size and the piece size with fairly good speed.

In the end it's probably a waste of time but I just like to experiment with stuff Big Grin
Reply
#13
I salute your healthy curiosity and sporty spirit!

I found this: http://burtleburtle.net/bob/crypto/magnitude.html
Quote:

2^256 = Possible 256-bit (32-byte) keys.
2^268 = Instructions executed by a nanocomputer the size of the sun running for a million years.
Calculated by 2^190 atoms / 2^20 atoms/processor * 10^16 instructions/second * 3600*24*365.25*1,000,000.
---

So in order to do the hash in 1 second, working with 32-byte hashes for a piece of a file, would require less than
3600*24*365.25*N million sun-sized nanocomputers, where N is the instruction count to run the algorithm.
Time would be 1/(2^268-2^256) s, thats too fast for what I wanted. Maybe hundreds, not millions of supernanocomputers.

Current byte nomenclature can't handle it, Vundabytes are 2^33 (11 orders of magnitide).
Reply
#14
Calculations done, now all we need is 1.21 gigawatts power source Big Grin
Reply
#15
lol guess not even NSA can afford it.
Reply
#16
(Apr 18, 2018, 07:03 am)Mr.Masami Wrote: According to my poor math skills 32 byte block has 2^256 combinations which equal to 115792089237316195423570985008687907853269984665640564039457584007913129639936 possible combinations.
So the hashing algorithm needs to compute 3705346855594118253554271520278013051304639509300498049262642688253220148477952 bytes which equal to 3369993333393829974333376885877453834204643052817571560137951281152 TiB of data.

Piece of cake.

[Image: e2SzWCP.png]

@edit: Tested with 2-byte pieces, reconstruction works 1:1 and fairly fast but of course no point of using pieces < hash, some special/optimized hashing algorithm just for this purpose would be needed.
Later I will try different piece size and maybe write some special hashing function for this purpose, it doesn't need to be sophisticated, actually - the simpler the better.
All we need is a good ratio between hash size and the piece size with fairly good speed.

In the end it's probably a waste of time but I just like to experiment with stuff Big Grin
You guys are my heroes =D

No I BELIEVE this is the FUTURE of file compression, with FPGAs and other specialized cpu types appearing I believe this is the future of the field, Arbitrarily leaving out some data to be calculated later.

I am working on somthing I call Hierarchical CRC which just stores the crcs for specific lengths of data within the source file(s).

The length will be stored in the header so no guessing needed.  

i have two lengths of data, a sector(4 bytes long) and a cluster(8 bytes long).

So I store the checksums for the lengths of data.

And you end with around twice the sector checksums as cluster checksums altho you could interleave these, i.e. jump only 4 bytes forward per cluster checksum.

Then when i weave back together the sectors i bulk(*1) reconstructed earlier all I have to do is wander the cluster sector possibilitys(*2) to weed out collisions then we use a variable length checksum, one that grows more complex as data complexity grows.

I believe the most difficult task would be getting the system to weed the majority of collisions so anything else that appeared to match would be obvious error on open.
Collisions are ok as long as they are managed.

What do ya all think?

*1, why not generate the sectors only once(store them on disk for use later) using the list of sector crcs as match tool.
*2, by checksuming the raw data for two sectors you can verify them against the cluster crcs.
Reply
#17
We aren't geniuses but so far the task is too complex for current knowledge and technology; it is theoretically possible but takes too much time or resources.
How did your compression model fare in tests?
Reply
#18
(Nov 26, 2018, 09:13 am)dueda Wrote: We aren't geniuses but so far the task is too complex for current knowledge and technology; it is theoretically possible but takes too much time or resources.
How did your compression model fare in tests?

It hasn't fared in tests yet, very much little work in progress but the compression factor is pretty arbitrary if you go with the crc(8bit) per sector and crc(8bit) for cluster(non interwoven) its better than 50% compression.


I propose to limit the search length to 256*256*256*256=4294967296 combinations of 4 bytes meaning or 16384 MB, fairly reduced cpu requirements dont ya think or do I have my head up my ass somwhere?

I know! I'll take some meth and see if I can finish the encoder tonight, thats the easy part. Maybe then I can start work on the fun part, the decoder.(Popeye the sailor man plays quietly in the background)

If I do finish the task, it will be for single file only and it will be assumed that you have already used de-duplication based compression already.

Oh and just to totally phone the whole project in i'm gonna use the nice easy language of VB .Net

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

So I still havn't finished the encoder, can't answer a basic question, how much should the final checksum grow? Like 1 byte extra per megabyte?

I suppose i should ask does anyone know what file sizes are the maximum before a hash or checksum algorithm suffers from too many collisions? Like for MD5 and the like?
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Reducing mp4 file sizes Mach234 2 521 Mar 05, 2024, 10:18 am
Last Post: Mach234
  My Customised GNU/Linux Distros: Is it worth sharing? RobertX 16 9,548 Feb 06, 2024, 21:10 pm
Last Post: RobertX
  How to play a SSIF file ? WW3hasstarted 4 1,797 Dec 28, 2023, 11:19 am
Last Post: WW3hasstarted
  Modify date for .dat file zillah 0 8,068 Mar 29, 2022, 05:28 am
Last Post: zillah
  Power Director and HOSTS file sixeyeco 6 15,373 Dec 10, 2021, 01:06 am
Last Post: steelfox



Users browsing this thread: 1 Guest(s)