Page MenuHomePhabricator

new history compression class
Closed, DeclinedPublic

Assigned To
Authored By
bzimport
Jun 3 2005, 3:22 PM
Referenced Files
F2113: SpecialImport.php.diff
Nov 21 2014, 8:35 PM
F2112: convertDump
Nov 21 2014, 8:35 PM
F2111: export-0.2.xsd
Nov 21 2014, 8:35 PM
F2110: dumpBackup.php.diff
Nov 21 2014, 8:35 PM
F2108: SpecialExport.php.diff
Nov 21 2014, 8:35 PM
F2107: HistoryBlob.php.diff
Nov 21 2014, 8:35 PM
F2106: historyblobtest.php
Nov 21 2014, 8:35 PM
Subscribers

Description

Author: elwp

Description:
This class (SplitMergeGzipHistoryBlob) compresses large pages much better than
the old class ConcatenatedGzipHistoryBlob, and is reasonably fast. The attached
program historyblobtest.php includes some speed tests. To use it, export a
page with its complete history and call 'php historyblobtest.php pagename.xml'.

Unlike ConcatenatedGzipHistoryBlob SplitMergeGzipHistoryBlob does not use
serialization. So to create and save an object, use

$obj = new SplitMergeGzipHistoryBlob( $compressedBlob )

and

$compressedBlob = $obj->getCompressedBlob();

Three states are defined for SplitMergeGzipHistoryBlobs: SM_COMPRESSED,
SM_READONLY (uncompressed, but sections and indices not yet extracted)
and SM_READWRITE (completely converted into arrays). This is because
it would be too much overhead to extract all sections if only one revision
is requested. The layout of the flat uncompressed data as used in state
SM_READONLY is described in
http://meta.wikimedia.org/wiki/User:El/History_compression/Blob_layout


Version: 1.5.x
Severity: enhancement
OS: Linux
Platform: PC

Details

Reference
bz2310

Event Timeline

bzimport raised the priority of this task from to Lowest.Nov 21 2014, 8:35 PM
bzimport set Reference to bz2310.

elwp wrote:

HistoryBlob.php

attachment HistoryBlob.php ignored as obsolete

elwp wrote:

test program

Attached:

it all looks cool, but... what happens in case of hash collision?

elwp wrote:

If two texts have the same hash, the second text is not stored in the
history blob. But this is the same behaviour as in ConcatenatedGzipHistoryBlob.
Maybe some very intelligent person is able to compose a different text with
the same hash. But noone will notice that. It looks like a normal reversion.
Hash collisions between random texts are very unlikely.

elwp wrote:

corrected version of HistoryBlob.php

I forgot to set mMetaData in removeItem().

attachment HistoryBlob.php ignored as obsolete

elwp wrote:

HistoryBlob.php.diff

Attached:

elwp wrote:

SpecialExport.php.diff

Attached:

elwp wrote:

dumpBackup.php.diff

Changes depend on changes in HistoryBlob.php and SpecialExport.php.

Attached:

elwp wrote:

export-0.2.xsd

I don't know if this is a correct XML schema.

Attached:

elwp wrote:

convertDump (perl script)

This is a demo program that converts dumps generated
with the --splitrevisions and --usebackrefs options to the
old format. It is very slow and the documents generated with it
differ somewhat from documents generated with dumpBackup.php
(more whitespace and shuffled attributes).

Attached:

jeluf wrote:

Is there also an import script ready? For importing the dumps back into mysql?

elwp wrote:

In http://mail.wikipedia.org/pipermail/wikitech-l/2005-May/029298.html Brion wrote:

I still need to finish up an importer script using the Special:Import
framework.

I didn't find such a script in CVS yet. When he's finished I'll adapt the script
and SpecialImport.php to the new format (provided that my code will be accepted).

No comments yet, haven't had time to follow this in detail, but
please keep it up -- more efficient compression ordering would be
*real nice* to have for 1.6 (or if it's a clean integration, perhaps
a merge to 1.5).

Definitely on the 1.6 roadmap.

elwp wrote:

SpecialImport.php.diff

MAX_FILE_SIZE = 2000000 is too small for uncompressed page histories, so I also
added the possibility to upload gzipped XML files. But now the size of the
uncompressed data may exceed the memory_limit. It's probably a bad thing to
hold the complete page history in memory. Hmm...

Attached:

Can you make this a unified diff?

Instead of inflating the file in-memory, it would probably be better to read it from a stream -- there should already be classes for doing that for importDump.php, I think.

Another old patch that's gotten recent comments. :)

This seems to combine a new history compression class with some sort of changes to the XML export format, I think mainly to allow marking identical revisions.

Assigning to Tim for the blob stuff, if there's anything in here we want to adapt to current stuff.

It's very likely that the xdiff-based solution I implemented will out-perform this one. But if you disagree, feel free to run some benchmarks to compare them.