Decoding the Dictionary: The Mechanics of Torrent Files and Hash Values
#1

Written by me in October 2014; originally posted as a KAT tutorial.


Below, I will investigate the intricacies of torrent metadata, and exactly how the crucial info_hash identifier is derived from it. A thorough understanding of these details allows us to directly examine, and, subject to the exercise of a lot of caution, even to manually edit, .torrent files. The tutorial is intended as a resource for anyone about to tackle those challenges, as well as for those who are just idly curious. I shall do my best to make the contents broadly accessible; the only prerequisite should be a certain minimal standard of technical literacy.





0) Concepts


"Metadata" refers to "data about data", or, more concretely, "data that describes other data". For example, if we consider the preceding sentence as data, an associated item of metadata could be the number of words it contains (14). Torrent files are filled with such metadata, describing the original content based on which they were created and whose transfer they are intended to facilitate.

This data is stored in a hierarchical manner, using the "Bencode" format, which is even simpler than, and almost as transparent as, the bbCode used in this forum. It employs the following three basic concepts of information theory.

Delimited sequences: One of the ways to store data of variable length is to define a pair of flags marking its start and end points. For this to work, those flags cannot occur within the data itself. bbCode tags, for example, are based on this - the string [b]blah[/b] consists of a start tag, followed by an unspecified number of characters, followed by an end tag.

Value pairs: Associating two values can result in elevating mere data to the level of information; which is to say that, in a very real sense, the meaning contained in such a combination exceeds the sum of the meanings of its parts. Consider, for instance, the word "age" and the number "42", neither of which really tells us much on their own - and then consider the combination "Age: 42".

Lists of items: By contrast, if we have a bunch of data of the same type, such as "apple", "orange", and "plum", we may need a structure that can contain the lot without losing track of where one ends and the next begins - "appleorangeplum" is clearly problematic.


1) Bencode specifications


The format used in .torrent files combines the concepts outlined above in various ways to produce the following four basic structures (described here in order of complexity). Any linebreaks and indentations occurring in the examples below and hereafter are present only for purposes of legibility, and would be absent in any actual use of the format.

Delimited integers: Numbers are stored as text - which has both advantages and disadvantages, compared to storing them in binary form - and are prefixed with an "i" for "integer" and suffixed with an "e" for "end".

--i1e
--i987e

Tallied strings: The same approach does not work for text and binary data, because whichever characters we choose as delimiters, there is no way to ensure that they don't appear as part of the data. Instead, these are stored as a value pair, consisting of the string itself and a number declaring its length. The number goes first, followed by a colon, followed by the string; this ordering removes any potential for ambiguity, even in cases in which further numbers and colons appear as part of the text.

--4:blah
--6:Age:42

Externally delimited lists: The list structure is exceedingly simple, consisting in nothing more than the sequence of items, prefixed with an "l" for "list" and suffixed again with an "e". Unlike in the earlier "appleorangeplum" example, no internal delimiters (such as commas) are required, because unlike ordinary words, each item must be a Bencoded entity in its turn.

--l
----5:apple
----6:orange
----4:plum
--e
--
--l
----i1e
----4:blah
----l
------3:age
------i987e
------5:apple
----e
--e

Externally delimited ordered dictionaries: The final and most powerful construct is what is referred to as a "dictionary". In common usage, the word describes what boils down to a list of entries, each of which consists of a term and a definition or description of that term. The IT jargon sense is essentially the same, only a bit more general: A list of value pairs, each consisting of a string called the "key" and data associated with that key. Furthermore, and again just as is generally the case for ordinary dictionaries, the items must be listed such that the keys are in ascending order (cf lexicographical order @ Wikipedia for specifics). As with simple lists, no additional internal delimiters are necessary because each constituent is a self-contained entity.

--d
----7:numbers
------l
--------i1e
--------i987e
------e
----6:fruits
------l
--------5:apple
--------6:orange
--------4:plum
------e
--e
--
--d
----3:Age
------i42e
----4:Name
------d
--------5:First
----------7:Freddie
--------4:Last
----------5:Femur
------e
--e



2) Torrent file specifications


This now gives us the toolkit to pick up where we started off: The, for lack of a better term, physical contents of any .torrent file are nothing more or less than the Bencoded metadata associated with what we generally think of as the "contents" of the "torrent". The usual means of displaying .torrent files, such as websites like this one and BitTorrent clients like the one you're using, hide the former and show us the latter. To see the contents naked, as it were, we have to open the file using software which doesn't know or care about how the BitTorrent protocol works - a hex editor, for example.

The bulk of the metadata is just what common sense would suggest: A description of the files/the data to be torrented, and a tracker listing. For principally historical reasons, the specific substructure of the corresponding blocks differs somewhat depending on whether there is a single file or tracker, or several of either, so we're going to look at examples of both cases. For illustration purposes, our "original data" will consist of two tiny text files, "abba.txt" and "abc.txt" (backup archive), the contents of which match their names: The "abba" file is exactly 64 kB in size and contains nothing but the character "a" (plus some line breaks) in its first and last and nothing but "b" in its two middle quarters. Along the same lines, "abc" is 48 kB in size and consists of three equally-sized portions, filled with, you guessed it, "a", "b", and "c". The reasoning behind those particular choices will become clear in due course.

First example (single-file torrent with a single tracker)

Using the current mainline client (BitTorrentPlus 7.9.2), I now create a torrent from the "abba" file, adding one tracker and a short comment and setting the piece size to 16 kB. The resultant .torrent file is 324 Bytes in length and has the following contents, formatted for legibility as before.

--d
----8:announce
------38:udp://tracker.publicbt.com:80/announce
----7:comment
------30:This is a single-file torrent.
----10:created by
------16:BitTorrent/7.9.2
----13:creation date
------i1413650210e
----8:encoding
------5:UTF-8
----4:info
------d
--------6:length
----------i65536e
--------4:name
----------8:abba.txt
--------12:piece length
----------i16384e
--------6:pieces
----------80:
------------0x1A 0xD6 0xF6 0x4C 0x8D 0x94 0xFA 0x2E 0x20 0x54 0xD3 0xF6 0xE0 0x1A 0xB7 0x2A 0xE3 0x34 0xF2 0xD9
------------0x13 0xF7 0xEB 0x29 0x20 0x01 0x54 0x6E 0x42 0x9D 0x0C 0x7F 0x81 0x27 0xCD 0xD2 0xB8 0x39 0x0D 0x85
------------0x13 0xF7 0xEB 0x29 0x20 0x01 0x54 0x6E 0x42 0x9D 0x0C 0x7F 0x81 0x27 0xCD 0xD2 0xB8 0x39 0x0D 0x85
------------0x1A 0xD6 0xF6 0x4C 0x8D 0x94 0xFA 0x2E 0x20 0x54 0xD3 0xF6 0xE0 0x1A 0xB7 0x2A 0xE3 0x34 0xF2 0xD9
------e
--e

Starting from the top (both in the linear and the hierarchical sense), the whole thing is a dictionary with a handful of entries, most of which are simple value pairs, while the last one is another dictionary called "info". The first key is "announce", and the value makes clear that this refers to the lone tracker; then comes the comment I added; then a creator signature; a creation timestamp ("in standard UNIX epoch format (integer, seconds since 1-Jan-1970 00:00:00 UTC)", according to the design document); a charset designation for the text-based portions; and finally the second, subordinate dictionary... which is where things get interesting!

The first of the entries in the "info" dictionary, "length", equates to 64k, so it must refer to the size of the original file - as well as that of the torrent in its entirety, as it contains nothing but said file. Then comes the (file-) "name", the "piece length" equating to 16k, and last but not least a field called "pieces" which contains an 80-byte data block, displayed here in standard hexadecimal notation (each "0x##" snippet corresponds to a single byte). As it turns out and as shown above, this portion is more usefully considered as a series of 20-byte blocks, that being the length of a "SHA1"-type hash value, one of which is derived from, and can later be checked against, each of the (64k/16k=) four pieces. Whenever you instruct your torrent client to perform a "force re-check" on a partially completed download, for example, this is what the your local copy is "re-checked" against to determine whether each piece is identical to the corresponding one in the original copy (which is to say, complete) or not (incomplete).

And that's where the precise partitioning of our "abba" file pays off: The first and last of the four pieces are identical, as are the two middle pieces - and as a direct result, so are their hashes!

Which leaves us with a series of endings - first the "e" flag for the inner "info" dictionary structure, ditto that for the outer wrapper, and then the end of the file. (And of this example.)

Second example (multi-file torrent with multiple trackers)

Including multiple files and trackers produces a heftier and more deeply nested .torrent file. To demonstrate, I'm creating a second torrent from a directory named "txt" containing both text files and using more trackers. I'm also doubling the piece size, mainly to see what, if anything, changes relative to the first example when the total size (64 kB + 48 kB) is no longer evenly divisible by that figure, as would of course be the case in the vast majority of real-world scenarios. The resultant file is 535 Bytes long, and breaks down like so:

--d
----8:announce
------44:udp://tracker.openbittorrent.com:80/announce
----13:announce-list
------l
--------l
----------44:udp://tracker.openbittorrent.com:80/announce
--------e
--------l
----------35:udp://tracker.istole.it:80/announce
--------e
--------l
----------36:udp://open.demonii.com:1337/announce
--------e
------e
----7:comment
------29:This is a multi-file torrent.
----10:created by
------16:BitTorrent/7.9.2
----13:creation date
------i1413650291e
----8:encoding
------5:UTF-8
----4:info
------d
--------5:files
----------l
------------d
--------------6:length
----------------i65536e
--------------4:path
----------------l
------------------8:abba.txt
----------------e
------------e
------------d
--------------6:length
----------------i49152e
--------------4:path
----------------l
------------------7:abc.txt
----------------e
------------e
----------e
--------4:name
----------3:txt
--------12:piece length
----------i32768e
--------6:pieces
----------80:
------------0x88 0x60 0x14 0x9F 0xAC 0x5F 0x81 0x66 0xB9 0x68 0x39 0xB8 0x1C 0x8B 0x67 0x36 0xDC 0x86 0x53 0x93
------------0xF1 0xBE 0xBE 0x4B 0x24 0x51 0x47 0x92 0x67 0x68 0x03 0x67 0x1E 0x3C 0x4A 0x44 0x2D 0xE5 0x2F 0x5B
------------0x88 0x60 0x14 0x9F 0xAC 0x5F 0x81 0x66 0xB9 0x68 0x39 0xB8 0x1C 0x8B 0x67 0x36 0xDC 0x86 0x53 0x93
------------0x35 0x70 0xD6 0x98 0xFE 0xF6 0x2D 0x8D 0x2A 0xC4 0x92 0x87 0xCA 0x61 0x0D 0xD7 0x3E 0xBB 0xE7 0xF1
------e
--e

As before, the metadata is wrapped into a single outer dictionary structure. Also as before, the first entry ("announce") refers to a single tracker. From what I gather, this field ends up being ignored entirely, but is still required to be present for compatibility with the original protocol specification (which, in other words, should have been designed with more flexibility to avoid this redundancy). Instead, the torrent will use the trackers listed in the following entry ("announce-list"). In fact, this is not just a list of trackers but a list of lists of trackers, and this added layer of complexity allows for the arrangement of trackers in so-called "tiers" which determine the order in and priority with which they are accessed - in reality, this is rarely taken advantage of, though, at least in my limited experience.

The next four fields are no different from their counterparts in the first example - comment, creator, timestamp, charset.

Which once more brings us to an inner "info" block structured as a second dictionary. Instead of the simple "length" value pair we found in the single-file example, though, its first entry, labelled "files", in turn contains a plethora of internal structure. On reflection, this is a clear necessity, too, because this is what maps to the the internal structure of the torrent's contents. Step by step, that appears to work like so:
  • "files" keys to a list of
    • dictionaries, one for each file included in the torrent, which contain a
      • "length" value pair, giving the file's size, and a
      • "path", which keys to the file's path relative to the torrent's top-level folder, represented as a list of
        • folder and file names.
Strictly speaking, this is more depth than is really required: Each file has exactly one length and one path, so one could do away with the extra layer of dictionaries and simply use a long list of alternating lengths and paths, with the understanding that the first and second list item refer to the first file, the third and the fourth item to the second file, and so on. Similarly, the path could be stored as a single string, relying on the usual internal delimiters (slashes) between folder and file names. That being said, the chosen implementation has conceptual and practical advantages, and due to the compactness of the Bencode format with its single-character flags, the overhead is negligible. Note that the total size of the torrent is not stated here or anywhere else, which implies that for a web site or torrent application to be able display it, they first have to derive it by summing over the individual files.

The second entry is the "name", just as before, though it now refers to the name of the top-level folder instead of that of the lone file itself. Third is "piece length", exactly as before, except for my arbitrary switch to 32 kB.

And that brings us once more to "pieces", which also works just the same as before: An 80-byte block, comprised of four 20-byte hashes. Note that the first and third of these coincide, which is of course no coincidence (pun intended), as the first piece corresponds to the "ab" portion of the "abba" file and the third to the "ab" portion of the "abc" file. To answer the earlier question, nothing about the final piece shows that it differs from the other three by being really only "half a piece" (corresponding to the final 16 kB of the "abc" file).

... end of dictionary, end of dictionary, end of file, end of example. End of section, even!


3) Hash value derivations


So far, so good, but where does the info_hash that identifies the torrent as a whole tie into all of that? Clearly, just like the total size in the second example, it's not directly part of the metadata stored in the .torrent file, so it has to be derived from some or all of its parts in some way. As it turns out, and as its name directly tells us (in hindsight, anyway), the process by which this is done isn't just equivalent to, but is fully identical to, the one that derives the piecewise hashes from the original data: The info_hash is derived directly from the whole of, and from nothing but, the "info" portion of the .torrent file (starting from and ending with and including the "d" and "e" delimiter flags of the inner dictionary structure), just as if it were regular data. Which, it could be argued, makes this hash an instance of "meta-metadata"!

This can be readily demonstrated by simply using our trusty hex editor to strip everything that's not part of that "info" portion from a given .torrent file, saving the result as a new file, creating yet another torrent from this new file, and then (h)examining that .torrent file in turn. Note that for this to work, all of the data has to be hashed into a single value, meaning that the piece size of the secondary torrent has to exceed the size of the remnant of the primary .torrent file (which, in this case, is a given anyway).

For the two examples, the info_hashes as reported by the client are 1BFF97884CB71F9D25FFCA63AAC2F117AD48431A and 77FB2B740728B4A5E81C508BEB2B954356F9B1A8 respectively, while the procedure outlined above correspondingly yields (this time formatted somewhat differently, in the interests of brevity and emphasis)

--d8:announce29:udp://tracker.publicbt.com:80
--7:comment4:Blah
--10:created by16:BitTorrent/7.9.2
--13:creation datei1413664354e
--8:encoding5:UTF-8
--4:infod
----6:lengthi146e
----4:name21:abba.torrent.info.dat
----12:piece lengthi16384e
----6:pieces
------20:1BFF97884CB71F9D25FFCA63AAC2F117AD48431A
--ee

and

--d8:announce29:udp://tracker.publicbt.com:80
--7:comment4:Blah
--10:created by16:BitTorrent/7.9.2
--13:creation datei1413664368e
--8:encoding5:UTF-8
--4:infod
----6:lengthi204e
----4:name20:txt.torrent.info.dat
----12:piece lengthi16777216e
----6:pieces
------20:77FB2B740728B4A5E81C508BEB2B954356F9B1A8
--ee

respectively (note the vastly different piece lengths, just for the sake of experimentation). Woot! :D


4) Practical implications


Now, what does the preceding section signify when it comes to the manual editing of .torrent files? Principally, that - short of messing up the syntax, which will cause the client to simply reject the torrent (at best, or to crash, at worst) - you should be able to freely modify the contents all the way up to the "info" marker to your heart's content. Changing anything after that point, on the other hand, will affect the hash, which is to say, you'll in effect no longer be modifying the old torrent but creating a new one entirely.

Concretely, that means that you can easily alter the creation timestamp and slightly less easily (due to having to keep track of the character tallies) alter the comment and creator signature fields. The same applies to exchanging one tracker for another, while the wholesale removal of trackers, or the addition of new ones, is trickier yet, considering that that'd involve modifications of higher-level list constructs (mind you, the latter may look reasonably straightforward in this tutorial, but is bound to be very fiddly in a non-hierarchical hex editor view). By contrast, switching to a different charset is easy in and of itself - but the effects of doing so can all too easily turn out disastrous, I'd expect.

Beyond that, it should also be possible to add entirely new entries to the outer dictionary; in fact, that is exactly how most of the later protocol extensions (like webseeds) and various client-specific features are implemented. Unless you do intend to write a BitTorrent application of your own which would have a use for such additional fields, that seems on the pointless side of things, though. If you do have such lofty intentions, do keep in mind that dictionaries have to be ordered by key, so you have to put the new entries in their proper places.


References






As always, I expressly invite intelligent questions and constructive criticism, as I've no doubt overlooked a few things while investigating this, and blundered a few times writing it up, so if you think you may have spotted such instances, please do not hesitate to point them out!


Written by me in October 2014; originally posted as a KAT tutorial.


Attached Files
.txt   abba.txt (Size: 64 KB / Downloads: 1,253)
.txt   abc.txt (Size: 48 KB / Downloads: 1,250)
Reply
#2
I remember that tut from Kat. It was (and is) the best explanation of a torrent file I've ever seen. I salute you.

A couple of points I would add:

1. One key you haven't mentioned is the "private" key, found within the "info" section in .torrent files uploaded to/downloaded from private torrent sites and sometimes found in .torrent files downloaded from public sites when people have misguidedly uploaded private torrents there. On detecting it's presence (with a value of 1) a torrent client will automatically forego using DHT or Peer Exchange to locate peers for that torrent, relying only on the trackers enclosed within it. And it's removal (bearing in mind it exists within the "info" section) will change the torrent hash so the torrent client will be able to use DHT and Peer Exchange but won't be able to connect to the original swarm.

2. The keys you have covered, "announce", "announce-list", "comment" etc. are specifically defined within the BitTorrent protocol--so ALL torrent clients will recognize and use them. But torrent clients simply and deliberately ignore any keys they do not recognize, which means that torrent client developers can and do create custom keys that only their clients use. Torrents created by uTorrent in particular contain a considerable amount of extraneous data.

I've attached two tools useful for examining .torrent files

TorrentSpy - is a "torrent aware" tool which displays the contents in an indented view similar to that you have used above.
BEncode Editor - is a general tool for editing bencoded files. e.g. it automatically handles the recalculating of string lengths if you edit any of the text fields etc.



PS. Re. your "temporary notes for forum staff" mentioning other sites is fine, and your markup is also fine.


Attached Files
.exe   TorrentSpy-0.2.4.26.exe (Size: 478.5 KB / Downloads: 0)
.exe   BEncode Editor.exe (Size: 254.91 KB / Downloads: 0)
Reply
#3
(Aug 13, 2017, 17:43 pm)Sid Wrote: I remember that tut from Kat. It was (and is) the best explanation of a torrent file I've ever seen. I salute you.

*blushes*

(Aug 13, 2017, 17:43 pm)Sid Wrote: One key you haven't mentioned is the "private" key, found within the "info" section in .torrent files uploaded to/downloaded from private torrent sites and sometimes found in .torrent files downloaded from public sites when people have misguidedly uploaded private torrents there.

Yep, I actually wondered about that too when re-reading this today. "Private" is not part of the baseline specs (presumably because at the time, no alternatives to traditional trackers were envisioned, so disallowing such alternatives would have been redundant), which may be why I didn't include it. But then neither are multiple trackers, and I evidently did include that. There may have been a reason for those choices - but if so, I no longer recall it. Or it may have been entirely arbitrary. Either way, it's worth a closer look.

Do you happen to know what this forum's per-post character limit is, if there is one? At both KAT and another site I re-posted this to today it was 20k, which wasn't quite enough for this writeup to fit into a single post, unlike here. Just wondering how much "breathing room" that leaves me. Wink
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Installer with files Mac Jonathanm 0 2,371 Oct 06, 2024, 09:18 am
Last Post: Jonathanm
  How to view .md files using Chrome and Firefox GOLU 0 41,598 Jun 17, 2015, 09:31 am
Last Post: GOLU
  How to stay private online: encrypt files, emails and browse the web anonymously demonS 2 29,421 Apr 19, 2015, 13:50 pm
Last Post: Robbie
  Anitivirus blocking files krucified 5 27,336 Nov 02, 2014, 07:49 am
Last Post: NokTham
  Is there a way to search individual files in torrents ? pirateware 5 23,511 May 25, 2014, 21:11 pm
Last Post: NIK



Users browsing this thread: 1 Guest(s)