Getting rid of "rogue codes" in Word documents

For certain Word documents, memoQ will show superfluous tags that apparently serve no purpose. Once you import your document, there's no way to delete these tags; the best you can do is to insert them by pressing F8 or put them at the end of the target segment by hitting Alt+F8.

There are several possible ways to reduce such tags prior to importing:

  • Select the whole document and set the right Language
  • Accept all changes
  • Turn off track changes
  • Turn off smart tags
  • Select the whole document and set character spacing - Scale 100%; Spacing: Normal; and Position: Normal
  • Save as Open Office and then back to Word (this may result in some loss of formatting)
  • Save doc as Word 6.0/95 (this may result in some loss of formatting)
  • TRADOS segment and clean up

These and other methods are discussed here: http://www.necco.ca/dv/word.htm#Rogue_codes

Some of these pre-import tips and others have been automated in a set of macros that I've assembled in a Word template with a custom toolbar (CodeZapper_2.3). This template also includes a few other pre- and post-processing macros that may be useful. It was principally intended for use with Deja Vu but memoQ users may also find it useful in some circumstances. You can find it in the files section (http://tech.groups.yahoo.com/group/memoQ/files/). Dave Turner

More ideas on removing rogue codes from Jim Wardell:

If Word files cause rogue codes in memoQ, pre-edit them looking for the problems suggested above by Dave. Also:

Make sure autohypenation is deactivated in the entire Word file.

Use Find and Replace to remove all optional hyphens.

Make sure the Word setting "Hyphenate words in CAPS" is off.

Sometimes rogue codes are caused when the font size in the source PDF text hovers between two integer sizes. The OCR output (which even allows half-point sizes) then can contain embedded font size changes (e.g. 11 pt. --> 11.5 pt. --> 11 pt --> 10.5 pt.). If you get a lot of these, and the Word file itself contains several font sizes that you want to retain, you may need to go through and select the continuous font-size passages and apply the font size you want to use (in the above case, for example, perhaps Arial 11 pt.). A slicker, more professional way to do this would be to apply a style definition to such passages. Your style definition would also include No Autohypenation and normal character spacing.

If your source document does not contain bold or italics, select the entire file and change all to Bold + Italics. Then select the entire file again and remove the bold and italics.

If you are scanning PDF files using OCR software (and therefore are the one who created the rogue codes in the first place!), take a close look at the detailed settings options that were in effect when you exported from OCR to Word. Only use those settings that you need, the others may be generating rogue codes.

The Arial Unicode font is installed by OmniPage. OmniPage inserts tags around umlauts in Word files. I've solved this problem by removing the Arial Unicode font from my system. It's also a good idea when using OCR software to use font matching to restrict the fonts that are allowed in OCR output, e.g. to Times New Roman and Arial.

If your client offers to convert PDF files to Word for you, consider that he/she may be using a cheap PDF converter program to do this and might not have the slightest clue what he/she is doing. In this case, you're better off getting the PDF file yourself and using high-quality OCR software to do the conversion yourself. Be aware that PDF converters, even the best ones, are likely to cause more problems with rogue codes and formatting than professional quality OCR software. You get what you pay for. There also are major differences between the two leading professional-grade OCR programs in their ability to reduce rogue codes and produce TM-friendly formatting. Test both carefully and see which gives the more TM-friendly results in your situation.

The suggestions offered by Dave and myself should help you eliminate nearly all rogue codes, however if you are still getting too many, open MQ and notice where a given rogue code occurs, then open the offending file in Word at the same time. Try selecting the characters around the offending location and check for any changes in settings. If you learn something useful, please register as a Wikibooks author and add them to this page.