Configuring ZipNumCluster
Configuring ZipNumCluster
See Aaron Swartz’s post on ZipNum and CDX cluster merging for background.
Enable the ZipNumClusterSearchResultSource
inside CDXCollection.xml
:
<property name="resourceIndex">
<bean class="org.archive.wayback.resourceindex.LocalResourceIndex">
<property name="canonicalizer" ref="waybackCanonicalizer" />
<property name="source">
<bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
<property name="cluster">
<bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
<property name="summaryFile" value="/<PATH-TO-SUMMARYFILE>" />
<property name="locFile" value="/<PATH-TO-LOCFILE>" />
</bean>
</property>
<property name="params">
<bean class="org.archive.format.gzip.zipnum.ZipNumParams" />
</property>
</bean>
</property>
<property name="maxRecords" value="100000" />
<property name="dedupeRecords" value="true" />
</bean>
</property>
Summary file format
Tab-separated columns:
- First capture line in the chunk
- Chunk (shard) name
- Byte offset where the chunk starts
- Chunk length in bytes
Loc file format
Tab-separated columns:
- Chunk (shard) name
- Chunk URL: e.g.
hdfs://...
orhttp://...
Generate summary and loc files with the Hadoop tools linked in the article above.
Last updated on