Tune OpenWayback ZipNum Cluster Searches

ZipNum optimises CDX access by chunking indexes.

Configuration

<property name="source">
  <bean class="org.archive.wayback.resourceindex.ZipNumClusterSearchResultSource">
    <property name="cluster">
      <bean class="org.archive.format.gzip.zipnum.ZipNumCluster">
        <property name="summaryFile" value="/data/cdx/summary.txt" />
        <property name="locFile" value="/data/cdx/loc.txt" />
      </bean>
    </property>
  </bean>
</property>

Generate summary/loc files with the Hadoop job described in the docs.

Diagram

  flowchart LR
    A[ZipNum summary] --> B[ZipNumCluster]
    C[ZipNum loc] --> B
    B --> D[OpenWayback search]

Monitor catalina.out for ZipNum errors after configuration to ensure the loc URLs are reachable.