Dr. Mark Humphrys

School of Computing. Dublin City University.

Home      Blog      Teaching      Research      Contact

Search:

CA216      CA249      CA318

CA400      CA651      CA668


Binary v. Text



Machine-readable data - from binary to text

Traditionally, program data would be in very efficient binary format:
 (2-byte-number)(4-byte-number)(1-byte-character)(2-byte-number)....
Program needs to know structure of the file to display it. Otherwise it doesn't know where to put boundaries - might display:
 (4-byte-no)(1-byte-char)(2-byte-no)(4-byte-no)....
There has been a trend towards program data that humans can read in a text editor:
 (1-byte-character)(1-byte-character)(1-byte-character)....
which displays characters that express the contents. - html, xml, json, and (sort of) ps, tex



Text format is less efficient, but easier to work with

To store a 2 byte short integer in a file:


To edit the data:


"Human readable" machine data

With the readable format, an expert human can debug, tweak the data in a simple text editor, if they know what they are doing.

It is a much less efficient format - might take 16 1-byte characters to display the 2-byte-number - and this is why binary was so popular in the past. But can use such schemes now because machines more powerful, disk space bigger, bandwidth better.

HTML showed it could be done.
XML took the idea much further.
"Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises."




Secret-format binary (Microsoft Office)

Especially hard to work with are secret binary formats like the .doc format in Microsoft Word from 1983 to 2007 (and still widely used today).
With secret-format binary, you had to use Microsoft Word to modify the data.
The .doc format is still binary, but no longer secret-format.

If you use binary Microsoft Word .doc, it is hard to write scripts to manipulate your files as you can with text files. Instead you may have to point and click inside other people's menus. This is the main reason why I never used Microsoft Word.

If you use Microsoft Word .docx, the raw data is text format (XML). This is easier to script, but there are still issues.




Scripting .doc files

It is possible to write scripts to process Word (and other Office) files.

grep *doc

Q. Say you have 1,000 MS Word .doc documents on disk. How do you search for a string in them? You can't do:
grep string *doc

  1. The default search in Windows Explorer can search for strings in Word files, but only returns the list of files that match, not the detailed output that grep returns.
    Q. Can Windows Explorer search be called from a DOS script? (Would need to return text output list of files.)

  2. Desktop search programs will pre-index your files and search them with a GUI or Web interface.
    Some can search Word files.
    Some can be called through a programming API.
    Q. Can any of these be called from a DOS script like grep?
    1. Google Desktop
    2. Windows Search

  3. You can use VBScript.


sed *doc

Q. Say you have 1,000 MS Word .doc documents on disk. How do you go through them, changing string S1 everywhere it occurs into string S2?
This is easy if you have text format, UNIX and shell. See cweb.

  1. You can use VBScript.


More questions

  • How do I script .doc files on Linux? VBScript does not run on Linux.

  • Can I script access to online Office applications like Google Docs?
  • Google Apps Script - scripting access to applications in the Google Apps family.



The end of Office secret format



Part of a Word file in XML format (DOCX):


The above corresponds to this part of the document:



How to see the XML:
  1. Rename file.docx to file.zip
  2. Unzip it
  3. See document.xml



Scripting .docx files

Can we write a find and replace script for .docx files?
  1. Unzip .docx to get XML
  2. sed on the XML
  3. Zip the changed file back up as .docx
  4. Does this work? Or, if text changes, do other things (e.g. margins) need to change?

Feeds      w2mind.org

On Internet since 1987.