Archiving revisions
- Archiving Revisions
Archiving revisions is a crucial aspect of maintaining a healthy and efficient MediaWiki installation, especially for wikis experiencing substantial content creation and editing activity. This article provides a comprehensive guide for beginners on understanding and implementing revision archiving, explaining the benefits, methods, and configuration options available in MediaWiki 1.40 and later.
What are Revisions and Why Archive Them?
Every time a page is saved in a MediaWiki wiki, a new *revision* is created. This revision represents a snapshot of the page's content at that specific moment in time. These revisions are stored in the wiki's database, allowing users to view past versions, compare changes, and revert to earlier states if necessary. While this functionality is invaluable, it also means the database can grow rapidly, particularly on high-traffic wikis.
Storing every revision indefinitely can lead to several problems:
- **Database Bloat:** The database size increases significantly, potentially slowing down the wiki's performance. Queries take longer, page loading times increase, and overall responsiveness suffers.
- **Increased Backup Times:** Larger databases require longer backup and restore times, increasing the risk of data loss.
- **Higher Server Costs:** More storage space translates to higher server costs.
- **Performance Degradation:** Searching through a vast history can become inefficient, impacting features like change comparisons.
Archiving revisions addresses these issues by selectively removing older revisions from the active database while still preserving them for future access, albeit in a less readily accessible form. It’s a balance between maintaining historical data and optimizing performance. It’s similar to data lifecycle management principles used in other database systems. Consider the concept of Time series analysis when thinking about revision history – older data often becomes less relevant for immediate operational needs.
Understanding Revision Archiving Methods
MediaWiki offers several methods for archiving revisions. The best approach depends on your wiki’s size, traffic, and specific needs.
- **Automatic Archiving (Recommended):** This is the most common and generally recommended method. It utilizes a background process (a job queue) to automatically archive revisions based on configurable criteria, such as age and the number of revisions already stored. This is achieved through the `$wgArchiveMethod` configuration variable and related settings.
- **Manual Archiving:** This involves using a maintenance script to archive revisions directly. It offers more control but requires manual intervention and is less suitable for ongoing maintenance. It’s often used for one-time cleanups or for wikis with very specific archiving requirements.
- **External Archiving Tools:** Some third-party tools can assist with revision archiving, offering features beyond those provided by MediaWiki itself. These are less common and typically require more technical expertise to set up and maintain. They often leverage APIs and data export/import processes.
Configuring Automatic Archiving
Automatic archiving is configured through the `LocalSettings.php` file. Here’s a breakdown of the key configuration variables:
- `$wgArchiveMethod = 'default';` : Specifies the archiving method. `'default'` utilizes the standard archiving process. Other options may exist depending on installed extensions.
- `$wgArchiveDepth = 500;` : Determines the maximum number of revisions to keep for each page. Revisions exceeding this limit will be considered for archiving. This is a critical parameter; adjust based on your wiki’s content and editing frequency. Think about Volatility when setting this – pages with frequent, significant changes might need a higher depth.
- `$wgArchiveAge = '20240101';` : Specifies the age (in YYYYMMDD format) of revisions to be considered for archiving. Revisions older than this date will be targets for archiving. This allows you to archive revisions in batches based on age. Consider using a Moving average to determine a suitable age based on historical revision creation rates.
- `$wgArchiveExtension = 'archive';` : Defines the table where archived revisions will be stored. The default is 'archive'.
- `$wgArchivePrefix = 'Archived revisions of';` : Sets the prefix for the archive page name.
- `$wgArchiveScriptPath = '/w/index.php';` : Sets the path to the MediaWiki script for accessing archived revisions.
- Example Configuration:**
```php $wgArchiveMethod = 'default'; $wgArchiveDepth = 200; $wgArchiveAge = '20230101'; $wgArchiveExtension = 'archive'; $wgArchivePrefix = 'Archived revisions of'; $wgArchiveScriptPath = '/w/index.php'; ```
This configuration will archive revisions older than January 1, 2023, keeping only the 200 most recent revisions for each page.
How Automatic Archiving Works
The automatic archiving process runs as a scheduled job (typically via cron). It iterates through all pages in the wiki, identifying revisions that meet the criteria defined by `$wgArchiveDepth` and `$wgArchiveAge`. When a revision is selected for archiving:
1. **Data Transfer:** The revision’s content and metadata are moved from the `revision` table to the `archive` table. 2. **Revision Table Update:** The `revision` table is updated to remove the archived revision, leaving a marker indicating that the revision has been archived and its location in the `archive` table. 3. **Link Preservation:** The history page still displays all revisions, including archived ones. Clicking on an archived revision redirects the user to a special page that retrieves the content from the `archive` table.
It's important to note that the archiving process is *asynchronous*. This means it doesn't happen immediately when a revision meets the archiving criteria. Instead, it's queued and processed in the background, minimizing the impact on wiki performance. Understanding this asynchronous nature is key to troubleshooting archiving delays. Consider using Queuing theory to model the archiving process and optimize job queue settings.
Accessing Archived Revisions
Accessing archived revisions is transparent to users. When viewing a page's history, archived revisions are displayed alongside current revisions. Clicking on an archived revision will redirect you to a special page that displays the archived content. The URL will typically include the archive table name and the revision ID. The archived revisions are still searchable, although the search performance might be slightly slower than searching through the active revision table.
Manual Archiving: Using Maintenance Scripts
While automatic archiving is preferred, manual archiving can be useful in specific scenarios. MediaWiki provides a maintenance script called `archive.php` for this purpose.
- Usage:**
```bash php maintenance/archive.php ```
This script will archive revisions based on the same criteria defined in the `LocalSettings.php` file (`$wgArchiveDepth` and `$wgArchiveAge`). You can also specify additional options, such as limiting the number of pages processed or archiving revisions for a specific namespace. Refer to the MediaWiki documentation for a complete list of available options.
- Caution:** Manual archiving can be resource-intensive, especially on large wikis. It's recommended to run it during off-peak hours to minimize the impact on wiki performance.
Monitoring and Troubleshooting
- **Job Queue:** Regularly monitor the job queue to ensure that the archiving process is running smoothly. You can access the job queue through the Job Queue special page.
- **Error Logs:** Check the MediaWiki error logs for any errors related to archiving.
- **Database Size:** Monitor the database size to track the effectiveness of archiving.
- **Performance:** Monitor wiki performance (page loading times, query execution times) to identify any potential issues caused by archiving. Tools like Performance monitoring dashboards can be invaluable.
- **Archive Table:** Verify that the `archive` table is growing as expected.
- **Cron Configuration:** Ensure that the cron job for running the archiving process is configured correctly and running on schedule.
Best Practices for Revision Archiving
- **Start with Conservative Settings:** Begin with a relatively high `$wgArchiveDepth` and a distant `$wgArchiveAge`. Gradually reduce the depth and age as you gain confidence.
- **Regular Monitoring:** Continuously monitor the archiving process and adjust the settings as needed.
- **Backup Before Archiving:** Always create a full database backup before performing any archiving operation, especially manual archiving.
- **Consider Namespace-Specific Settings:** You might want to use different archiving settings for different namespaces (e.g., archive revisions in the main namespace more frequently than in user talk pages). This can be achieved using extensions or custom code.
- **Understand the Implications:** Thoroughly understand the implications of archiving before implementing it. Ensure that you have a reliable backup strategy in place.
- **Test in a Staging Environment:** Before applying any changes to your production wiki, test them in a staging environment.
- **Document Your Configuration:** Keep detailed documentation of your archiving configuration, including the settings used and the reasons for those settings.
Advanced Considerations
- **Extension Support:** Several extensions can enhance revision archiving capabilities, offering features such as more granular control, advanced reporting, and integration with external archiving services.
- **Customization:** For highly specific requirements, you can customize the archiving process by modifying the MediaWiki core code or writing custom extensions.
- **Data Retention Policies:** Develop a clear data retention policy that outlines how long revisions will be archived and when they will be permanently deleted. This policy should align with your organization’s legal and regulatory requirements. Consider frameworks like Data governance to ensure compliance.
- **Database Optimization:** After archiving, consider running database optimization tasks (e.g., ANALYZE TABLE) to improve performance.
- **Archiving Strategies:** Explore different archiving strategies, such as archiving revisions based on the number of views or the number of edits. Utilizing Statistical process control might help identify optimal archiving thresholds.
Related Concepts
- Database administration
- MediaWiki configuration
- Page history
- Diffs
- Maintenance scripts
- Cron jobs
- Database backups
- Performance optimization
- Special pages
- Extensions
Useful Resources
- MediaWiki Manual: [1](https://www.mediawiki.org/wiki/Manual:Configuration_settings/Variable_reference#Archiving)
- MediaWiki Archive Script Documentation: [2](https://www.mediawiki.org/wiki/Manual:archive.php)
- Database Performance Tuning: [3](https://db.stackexchange.com/) (Stack Exchange Database forum)
- Cron Job Scheduling: [4](https://crontab.guru/) (Cron job generator and explanation)
- SQL Optimization Techniques: [5](https://www.sqlshack.com/sql-performance-tuning/) (SQL Shack)
- Database Indexing Strategies: [6](https://www.percona.com/blog/2016/03/07/index-tuning-in-mysql-5-7/) (Percona)
- Data Lifecycle Management Best Practices: [7](https://www.ibm.com/topics/data-lifecycle-management) (IBM)
- Database Normalization: [8](https://www.guru99.com/database-normalization.html) (Guru99)
- Database Sharding: [9](https://www.kamatera.com/blog/database-sharding/) (Kamatera)
- Database Replication: [10](https://www.digitalocean.com/community/tutorials/how-to-configure-mysql-replication) (DigitalOcean)
- Understanding Database Transactions: [11](https://www.tutorialspoint.com/sql/sql_transactions.htm) (Tutorialspoint)
- Database Query Optimization: [12](https://www.red-gate.com/simple-talk/sql-server/database-administration/query-optimization/) (Red Gate)
- Database Security Best Practices: [13](https://owasp.org/www-project-top-ten/) (OWASP)
- Database Backup and Recovery Strategies: [14](https://www.veeam.com/blog/database-backup-recovery-strategy/) (Veeam)
- Database Clustering: [15](https://www.percona.com/blog/2019/08/23/mysql-cluster-introduction/) (Percona)
- Database Schema Design: [16](https://www.lucidchart.com/blog/database-schema) (Lucidchart)
- Database Performance Monitoring Tools: [17](https://www.solarwinds.com/database-performance-monitoring) (SolarWinds)
- Time Series Database Considerations: [18](https://www.influxdata.com/use-cases/time-series-data/) (InfluxData)
- Anomaly Detection in Databases: [19](https://www.datanexus.io/blog/anomaly-detection-databases) (Datanexus)
- Database Auditing: [20](https://www.imperva.com/learn/database-security/database-auditing/) (Imperva)
- Data Compression Techniques: [21](https://www.decompress.io/) (Decompress)
- Database Partitioning: [22](https://www.sqline.org/database-partitioning/) (SQLine)
- Database Virtualization: [23](https://www.vmware.com/topics/glossary/content/database-virtualization.html) (VMware)
- Database Security Standards: [24](https://www.pcisecuritystandards.org/) (PCI Security Standards Council)
Start Trading Now
Sign up at IQ Option (Minimum deposit $10) Open an account at Pocket Option (Minimum deposit $5)
Join Our Community
Subscribe to our Telegram channel @strategybin to receive: ✓ Daily trading signals ✓ Exclusive strategy analysis ✓ Market trend alerts ✓ Educational materials for beginners