Lessons on Polarion SVN Importer.
On and off over the last few years or so I have been working with the open-source Polarion SVN Importer tool to help customers migrate to Subversion (SVN) from legacy version control systems (aka VCSs) like CVS, Serena PVCS, Borland StarTeam, IBM ClearCase and MKS Integrity.
In this and future posts, I’m going to share my experience with this tool. While the tool has it’s limitations, in total its strengths make it worth your time if you find yourself searching for a so-called “full history” migration from one of these other VCSs to SVN (more on the dreaded “full history” term a little later). Furthermore, you could also use this tool to perform a legacy VCS Git migration, as there is the well regarded svn2git tool that provides high quality conversion capabilities between SVN and Git.
At any rate, for right now I’m simply going to go over the framework and model that SVN Importer uses, and then touch on the high-level features and benefits, as well as some of the limitations and gotchas. Expositions on migrations from specific VCSs will (hopefully) follow in later posts. I’m also working on getting some of the updates I’ve made to the SVN Importer code base published to GitHub, so stay tuned for news on that as well.
How the Polarion SVN Importer works
To start, it helps to understand the challenges that we face in trying to convert from these legacy VCSs to SVN. All of the older systems mentioned above handle source revisions at the file level, as opposed to SVN which records revisions on the repository as a whole. SVN’s style allows multiple file updates to be consolidated into a single repository revision, sometimes referred to as a changeset . The older style of defining independent file revisions for each file changed in a commit goes all the back to RCS and over time that model has shown it’s shortcomings. For one, commits with these systems are generally not atomic or transaction-based and so the systems have potential integrity issues. In the usability department, I think most users agree that having the notion of changesets in your VCS is preferable to tracking individual file revisions (though some of the above systems do have some changeset capabilities).
Besides functionality this fundamental difference in models between these systems and SVN present some challenges for how to translate the file revisions from these systems into SVN repository revisions. For some legacy repositories (e.g. CVS and PVCS), there is no other signaling metadata that can help the tool group these file revisions into SVN repository revisions and so each file revision is brought over as a unique repository revision. Other systems have various types of signalling metadata like change requests that can be used to group otherwise disparate file revisions into single repository revisions. In general, every source VCS provider has it’s own model in the software that is then transformed into the SVN model in order to generate a SVN compatible dump file from requests to the source VCS.
The Good Stuff
While each of Polarion SVN Importer individual VCS providers has multiple features and configuration options, globally the tool has a pretty limited set of options, though I must say this is not necessarily a bad thing. At any rate, the one feature that truly stands out as both appealing and actually usable is the capability to do incremental conversions. This allows conversions to run once to convert all current history to a SVN dump file and to then pickup the process at a later time to convert only what has changed in that time. This gives you some flexibility if you need to allow developers to continue working while the conversion process runs. You can then test against that converted dump for awhile to be sure it passes muster (tags are accurate, etc.). You can update that conversion incrementally to make your production transition at a moments notice. This way even the biggest source repository can be converted to a single SVN repository with all of the history, with a minimum of disruption to development activities.
As a general feature, the other standout with this tool is the overall quantity and quality of the metadata that is brought over. In most cases, the tool is explicitly programmed to read the various source VCS metadata such as commit date and time, commit user, tags, branches, Change Request (CR) numbers, etc. Generally this is all the data you could ask for from the source VCS. In some cases, I have had to extend the tool to support bringing some additional data and since it is Apache license software and fairly well designed, it is easy enough to make these adjustments if you know Java and the tools that the source VCS provides for gathering metadata. Usually the metadata is read via the tool’s command line interface (CLI) but some tools also provide API support.
But I digress … the real beauty of this tool is the quality of how the source VCS metadata is translated into SVN. Because the software knows the syntax for writing an SVN dump file, the converted SVN repositories are truly remarkable. All dates and times of commits are accurate, associated with the correct users. In addition, other metadata are stored as part of the SVN commit comments or SVN properties. For all intents and purposes, the converted repository has an accurate and complete history of the source repository, only in SVN format.
No Silver Bullet
The downsides to this tool all have to do with its performance. My biggest gripe is that some of the source provider models are inefficient and bloated (though in one case I encountered, I consider this most directly the fault of the legacy VCS for having weak command line tools and a bloated API). Regardless which source provider you use, since the process of transforming the source model to the SVN model happens in memory, if you have a repository with hundreds of thousands (or millions) of files and/or file revisions, the memory usage of the process as it translates can balloon rather quickly, especially when there are also many tags and branches. In one extreme case I have seen the memory requirement climb past 24GB, though this was for a project with close to 10 million file revisions, hundreds of branches, and thousands of tags.
Besides the memory footprint, the processing time for very large projects can also become prohibitive. As previously mentioned, that’s one situation for certain where the incremental import feature can help immensely. Nevertheless, at this point it must be said, you should probably stay away from this tool if you want to convert repositories with millions of source revisions and cannot dedicate the necessary hardware (16GB +) and time (~2-3 days processing per million revisions) to the problem.
Then again, given the relatively small number of tools out there for this job, your options for these conversions are rather limited. I have written simple VCS conversion scripts from scratch before. If your needs are simple, the DIY approach is certainly doable for these legacy tools. On the other hand, if you have to write a from-scratch full history conversion program, migrating from a file revision VCS to SVN, supporting robust metadata migrations, tags and branches … you’re gonna have a bad time.
Of the other free tools out there, CVS users may of course be better off with with cvs2svn. While I’ve heard that tool is not without its wrinkles, I’ve had good success with its cvs2git sub-module on smaller CVS repositories. For these old style enterprisey VCSs, the only other tool out there for SVN conversions that I’m aware of is the cc2svn tool. As I understand, this tool is only capable of converting the history of a single ClearCase view at a time, but given it’s lightweight Python implementation, it may be a nice alternative for ClearCase users with very large repositories that cannot use Polarion SVN Importer.
As a final note, I want to make clear that I believe the term “full history” conversion can be quite misleading. Whenever you perform a data migration on the scale of migrating VCSs, while all source code data should be preserved, you cannot avoid some changes in, at the very least, the form of the metadata. If your organization has strict requirements that your VCS data and metadata must be maintained for so many years previous for audit purposes, a migration of your data using a tool like the Polarion SVN Importer may or may not help to avoid that requirement.
Whew … OK, that’s all for now. Stay tuned for further posts and please let me know what you think in the comments.