Export relavant TU’s from legacy TMX files in OmegaT

Situation

You have a new project and legacy translation memory files that usually go to /tm folder of OmegaT project. You need to give this project to someone else, but you don’t want to give away all of your previous translation. Somehow you need to extract from your TMX files only those TU’s that have matches in the current project.

Problem

The problem is evident — you need OmegaT to get matches for each segment, and if they are any good, store them somewhere handy, in a separate TMX file.
The problem has been (or still is) discussed on OmegaT Yahoo Group.

Solution


Here’s the script:

  • write_TMX_with_usefulTU.groovy
    /* Development of this script has been sponsored by Qabiria - <a class="linkification-ext" title="Linkification: http://www.qabiria.com" href="http://www.qabiria.com">www.qabiria.com</a>
     *
     * Purpose:	Export only those TU's that are relevant for the current project
     *	from TMX files in /tm into a new TMX file
     * #Files:	Writes 'exported_relevant.tmx'
     *	in subfolder 'tmx_export" in current project's root
     * #File format:	TMX v.1.4
     * #Details:	http: / / wp.me / p3fHEs-7x
     *
     * @author	Kos Ivantsov
     * @date	2013-09-10
     * @version	0.1
     */
    
    /*
     * Set "select_files" to 'yes' if you want to use file selector
     * to specify files for export. If anything else is specified, the script
     * will work with the complete project.
     */
    select_files = 'no'
    
    /*
     * Specify similarity threshold for found matches. Only the ones
     * above it will make into the exported TMX file
     */
    int similarity = 75
    /*
     * Specify wait time (in milliseconds) for each segment. It's the time
     * the script will wait for match pane to update. If may experiment with it
     * keeping in mind that if it's too low, you may end up having wrong TU's
     * (i.e. from previous segments) exported.
     */
    int sleeptime = 500
    
    import javax.swing.JFileChooser
    import org.omegat.core.Core
    import org.omegat.util.StaticUtils
    import org.omegat.util.TMXReader
    import static javax.swing.JOptionPane.*
    import static org.omegat.util.Platform.*
    
    def prop = project.projectProperties
    if (!prop) {
    	final def title = 'Export relevant TU\'s'
    	final def msg   = 'Please try again after you open a project.'
    	showMessageDialog null, msg, title, INFORMATION_MESSAGE
    	return
    }
    
    if (prop.isSentenceSegmentingEnabled())
    	segmenting = TMXReader.SEG_SENTENCE
    	else
    	segmenting = TMXReader.SEG_PARAGRAPH
    
    def sourceLocale = prop.getSourceLanguage().toString()
    def targetLocale = prop.getTargetLanguage().toString()
    def folder = prop.projectRoot+'/tmx_export'
    def fileloc = folder+'/exported_relevant.tmx'
    relevant_mem = new File(fileloc)
    sourceroot = prop.getSourceRoot().toString() as String
    
    // create file if it doesn't exist
    if (! (new File (folder)).exists()) {
    	(new File(folder)).mkdir()
    	}
    
    relevant_mem.write("",'UTF-8')
    relevant_mem.append("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n", 'UTF-8')
    relevant_mem.append("<!DOCTYPE tmx SYSTEM \"tmx11.dtd\">\n", 'UTF-8')
    relevant_mem.append("<tmx version=\"1.4\">\n", 'UTF-8')
    relevant_mem.append(" <header\n", 'UTF-8')
    relevant_mem.append("  creationtool=\"OmegaTScripting\"\n", 'UTF-8')
    relevant_mem.append("  segtype=\"" + segmenting + "\"\n", 'UTF-8')
    relevant_mem.append("  o-tmf=\"OmegaT TMX\"\n", 'UTF-8')
    relevant_mem.append("  adminlang=\"EN-US\"\n", 'UTF-8')
    relevant_mem.append("  srclang=\"" + sourceLocale + "\"\n", 'UTF-8')
    relevant_mem.append("  datatype=\"plaintext\"\n", 'UTF-8')
    relevant_mem.append(" >\n", 'UTF-8')
    relevant_mem.append(" </header>\n", 'UTF-8')
    relevant_mem.append("  <body>\n", 'UTF-8')
    
    if ((select_files == 'yes')) {
    	srcroot = new File(prop.getSourceRoot())
    
    	JFileChooser fc = new JFileChooser(
    	currentDirectory: srcroot,
    	dialogTitle: "Choose files to export",
    	fileSelectionMode: JFileChooser.FILES_ONLY,
    	//the file filter must show also directories, in order to be able to look into them
    	multiSelectionEnabled: true)
    
    	if(fc.showOpenDialog() != JFileChooser.APPROVE_OPTION) {
    	console.println "Canceled"
    	return
    	}
    
    	if (!(fc.selectedFiles =~ sourceroot.replaceAll(/\\+/, '\\\\\\\\'))) {
    		console.println "Selection outside of ${prop.getSourceRoot()} folder"
    		final def title = 'Wrong file(s) selected'
    		final def msg   = "Files must be in ${prop.getSourceRoot()} folder."
    		console.println msg
    		showMessageDialog null, msg, title, INFORMATION_MESSAGE
    		return
    	}
    	files = fc.selectedFiles
    }else{
    	files = project.projectFiles.filePath}
    
    active_segment = editor.currentEntry.entryNum()
    count = 0
    hitcount = 0
    
    def match_find_write = Thread.start {
    		files.each{
    		fl = "${it.toString()}" - "$sourceroot"
    		proj_files = project.projectFiles
    		proj_files.each{
    			if ( "${it.filePath}" != "$fl" ) {
    			/*ignore*/
    			//console.println "file \"$fl\" is not supported by OmegaT"
    			}else{
    			it.entries.each {
    			count++
    			editor.gotoEntry(it.entryNum())
    			info = project.getTranslationInfo(it)
    			if (info.isTranslated()) {
    				hitcount++
    				changeId = info.changer
    				changeDate = info.changeDate
    				creationId = info.creator
    				creationDate = info.creationDate
    				alt = 'unknown'
    				source = StaticUtils.makeValidXML(it.srcText)
    				target = StaticUtils.makeValidXML(info.translation)
    				relevant_mem.append("    <tu>\n", 'UTF-8')
    				relevant_mem.append("      <tuv xml:lang=\"" + sourceLocale + "\">\n", 'UTF-8')
    				relevant_mem.append("        <seg>" + "$source" + "</seg>\n", 'UTF-8')
    				relevant_mem.append("      </tuv>\n", 'UTF-8')
    				relevant_mem.append("      <tuv xml:lang=\"" + targetLocale + "\"", 'UTF-8')
    				relevant_mem.append(" changeid=\"${changeId ?: alt }\"", 'UTF-8')
    				relevant_mem.append(" changedate=\"${ changeDate > 0 ? new Date(changeDate).format("yyyyMMdd'T'HHmmss'Z'") : alt }\"", 'UTF-8')
    				relevant_mem.append(" creationid=\"${creationId ?: alt }\"", 'UTF-8')
    				relevant_mem.append(" creationdate=\"${ creationDate > 0 ? new Date(creationDate).format("yyyyMMdd'T'HHmmss'Z'") : alt }\"", 'UTF-8')
    				relevant_mem.append(">\n", 'UTF-8')
    				relevant_mem.append("        <seg>" + "$target" + "</seg>\n", 'UTF-8')
    				relevant_mem.append("      </tuv>\n", 'UTF-8')
    				relevant_mem.append("    </tu>\n", 'UTF-8')
    				console.println "-------\nFound translation for segment ${it.entryNum()}. Exporting"
    				}else{
    			sleep sleeptime
    			near = Core.getMatcher().getActiveMatch()
    			if (near != null) {
    				if (near.scores[0].score > similarity) {
    					hitcount++
    					changeId = near.changer
    					changeDate = near.changedDate
    					creationId = near.creator
    					creationDate = near.creationDate
    					alt = 'unknown'
    					source = StaticUtils.makeValidXML(near.source)
    					target = StaticUtils.makeValidXML(near.translation)
    					relevant_mem.append("    <tu>\n", 'UTF-8')
    					relevant_mem.append("      <tuv xml:lang=\"" + sourceLocale + "\">\n", 'UTF-8')
    					relevant_mem.append("        <seg>" + "$source" + "</seg>\n", 'UTF-8')
    					relevant_mem.append("      </tuv>\n", 'UTF-8')
    					relevant_mem.append("      <tuv xml:lang=\"" + targetLocale + "\"", 'UTF-8')
    					relevant_mem.append(" changeid=\"${changeId ?: alt }\"", 'UTF-8')
    					relevant_mem.append(" changedate=\"${ changeDate > 0 ? new Date(changeDate).format("yyyyMMdd'T'HHmmss'Z'") : alt }\"", 'UTF-8')
    					relevant_mem.append(" creationid=\"${creationId ?: alt }\"", 'UTF-8')
    					relevant_mem.append(" creationdate=\"${ creationDate > 0 ? new Date(creationDate).format("yyyyMMdd'T'HHmmss'Z'") : alt }\"", 'UTF-8')
    					relevant_mem.append(">\n", 'UTF-8')
    					relevant_mem.append("        <seg>" + "$target" + "</seg>\n", 'UTF-8')
    					relevant_mem.append("      </tuv>\n", 'UTF-8')
    					relevant_mem.append("    </tu>\n", 'UTF-8')
    					console.println "-------\nFound good match for segment ${it.entryNum()}"
    					console.println "Segment source text is: \n${editor.currentEntry.getSrcText()}"
    					console.println "\nMatch source is: \n$near.source"
    					console.println "Match translation is: \n$near.translation\n"
    					}else{
    						console.println "-------\nNo good match found for segment ${it.entryNum()}"
    						}
    			}else{
    				console.println "-------\nNo match found for segment ${it.entryNum()}"
    					}
    				}
    			}
    			}
    		}
    	}
    		editor.gotoEntry(active_segment)
    		relevant_mem.append("  </body>\n", 'UTF-8')
    		relevant_mem.append("</tmx>", 'UTF-8')
    
    		if (hitcount == 0){
    		relevant_mem.delete()
    		final def msg   = """\
    The script has processed $count segments.
    0 TU were exported.
    Empty file $relevant_mem has been deleted.\
    """
    		final def title = 'Export result'
    		console.println msg
    		showMessageDialog null, msg, title, INFORMATION_MESSAGE
    		}else{
    		final def msg = """\
    The script has processed $count segments.
    $hitcount TU were exported to $relevant_mem.\
    """
    		final def title = 'Export result'
    		console.println msg
    		showMessageDialog null, msg, title, INFORMATION_MESSAGE
    		}
    }
    
    return
    

    This script runs through your whole project or only through selected files (triggered in line 20) and gets fuzzy matches for each segment.

    If the segment is translated, its source and translation is exported to the resultant TMX. Otherwise segments without matches are skipped, and the matches whose similarity is above threshold (line 26) make it into the exported files. Other matches are ignored.

    It needs to be noted that only the best match per segment gets exported, not all the ones that score. This might be changed in future to be more inclusive, but someone would have to hint me as to how to get a total number of matches for any given segment in a script. Maybe I can dig it out myself too, but don’t hold your breath just yet.

    Another vital value to keep in mind is the time the script pauses at each segment to wait for its matches to appear. Currently it’s set to 500 milliseconds (line 33). It means that if the project is 1000 segments long, the script will take almost 10 minutes (500 seconds or 8 minutes 20 seconds) to process it. If there are a lot of TU’s in the legacy TMX file(s), or the computer where OmegaT runs is somewhat slow, this value has to be increased (which means longer processing time). While the script is running, you cannot use OmegaT, so if you need to prepare TMX for a rather big project, be ready to run the script and let OmegaT do the work for some time. Translated segments get processed much faster though and the pause interval is not taken into consideration for them. It might be worth trying to time how fast matches appear by going from segment to segment and setting this value accordingly.

    The script writes TMX file that can contain same entries if it runs through non-unique segments. To fix that, one can use TMXCleaner or prepare a temporary source file that will contain only unique segments: updated script here.

    You may want to disable automatic insertion of good matches (Options → Editing Behaviour → Insert the best fuzzy match), because once the script is invoked you won’t have much control over what is going on, till it finishes.

I need to admit, it was somewhat above my head, and I jumped into it without fully understanding what is involved. I want to thank Marco Cevoli at Qabiria for sponsoring the development of this script. Also a big thank you is due to Yu Tang for sharing invaluable programming ideas in a layman’s language. I realize, there’s plenty of room for improvements, and I’m eager to get any ideas, hints, comments and criticism.

But as of now


Good luck

UPDATE:

This little script was developed much further by cienislaw and Yu Tang, and here’s a link to a much cleaner final solution that incorporates all the developments:
grab_matches.groovy
Please, use this one and not the one published on this page. Big thank you goes to Marco for helping this thing to start, and to cienislaw and Yu Tang for taking my humble effort and bringing it much further than I could ever have myself.

7 thoughts on “Export relavant TU’s from legacy TMX files in OmegaT

  1. Pingback: Export relavant TU's from legacy TMX files in O...
  2. Hello Kos

    I saw this question on the Omega forum and it did surprise me.
    In my usual CAT tool, I would just have loaded the document to translate and the big TM as read only (BTM),create a new TM for the project, ask the CAT tool to leverage from the BTM above a certain fuzzy match percentage, and then sit back.
    Then commit all segments to TM (they will go to the new TM and not the backup BTM in read only) and it is done.

    For me, all this is not so evident in OmegaT as the “translate all segments functionality” is not so developed and so it needed a script.

    Regards

    SafeTex

  3. Hi Kos

    Yeah, it is but I still have my reservations about scripts which take a lot of time to manage (copy and put in folder on both of my computers and sometimes additional folders have to be created to run the scripts)

    I saw recently that someone had taken a script and incorporated it fully into the Omega T program (the ‘folders’ scripts are now in the OmegaT menu to give us access to any folder/file in the project)

    Personally, that is what I’d like to see happen with more scripts

    But I do appreciate the fact that people write and share scripts so thanks

    SafeTex

    • Well, this script creates the necessary folder by itself. And that Folders script (or plugin, rather) is a similar development that has been done outside of main OmegaT project and isn’t affiliated with OmegaT. Yu Tang, the guy who developed that plugin, might be a part of the development team, but it doesn’t mean that his development is automatically included in OmegaT. You must have installed that script manually. But I totally see it a beautiful feature of OmegaT — the opportunity for anyone to develop useful stuff.

      If you have several computers and you want to have your scripts or other settings in sync, why not use Dropbox or other cloud service with sync functionality for that? Or, if it’s in the same home/small office network, you can put them somewhere to a shared folder.

  4. Pingback: Export relavant TU’s from legacy TMX files in OmegaT | Terminology, Computing and Translation

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s