Binary v. Text
Machine-readable data - from binary to text
Traditionally, program data would be in very efficient binary format:
(2-byte-number)(4-byte-number)(1-byte-character)(2-byte-number)....
Program needs to know structure of the file to display it.
Otherwise it doesn't know where to put boundaries
- might display:
(4-byte-no)(1-byte-char)(2-byte-no)(4-byte-no)....
There has been a trend more recently towards program data
that humans can read in a text editor:
(1-byte-character)(1-byte-character)(1-byte-character)....
which displays
characters that express the contents.
- html, xml, and (sort of) ps, tex
- Display whether things are binary or text:
file *
See "man file".
Text format is less efficient, but easier to work with
To store a 2 byte short integer in a file:
- Binary format: First 2 bytes of the file are:
10010000 10000000
You just have to "know" that the first 2 bytes are to be read together
as defining an integer.
If we interpret them that way,
they define the number:
36992
- Readable text format: First 16 bytes of the file are:
00111100 01101110 01110101 01101101 00111110 00110011 00110110 00111001 00111001 00110010 00111100 00101111 01101110 01110101 01101101 00111110
which translates byte-by-byte as the characters:
60 110 117 109 62 51 54 57 57 50 60 47 110 117 109 62
that is (see character list
constructed using this shell script)
the characters:
'<' 'n' 'u' 'm' '>' '3' '6' '9' '9' '2' '<' '/' 'n' 'u' 'm' '>'
i.e. when displayed in a text editor this file will read:
<num>36992</num>
All this to make the number human readable.
Since the number may vary,
this may not always be the first 16 bytes of the file.
In general, it is the first line of the file.
What does that mean? It means read up until the carriage return character.
To edit the data:
- Binary format:
Write a program to edit it.
- Text format:
Use text editor to edit it.
"Human readable" machine data
With the readable format,
an expert human can debug, tweak the data
in a simple text editor,
if they know what they are doing.
It is a much less efficient format
- might take 16
1-byte characters to display the 2-byte-number
- and this is why binary was so popular in the past.
But can think about adopting such schemes now
because machines getting more powerful, disk space bigger,
bandwidth better than used to be.
HTML showed it could be done.
XML
is now taking idea one step further.
"Like HTML, XML files are text files that people shouldn't have to read, but may when the need arises."
Especially hard to work with are
secret binary formats like
Microsoft Word
(before Word 2007),
where you have to use somebody else's application to modify the data.
If you use Microsoft Word .doc,
you do not have access to the raw data of your document file.
You cannot
grep it at the command line.
You cannot easily write scripts and small utilities to manipulate
your documents
yourself (e.g. search utilities).
Instead you have to point and click inside other people's menus.
This is the main reason why I never used Microsoft Word.
-
What's So Bad About Microsoft?
-
Alternative View of the Microsoft Monopoly
- 1999 discussion on the secret file formats.
- Argument that the government should force Microsoft to open their formats
(which happened anyway in 2007).
-
Some people point out that of course these are really bad formats:
"The file formats of MS Office were designed by Microsoft to be difficult to reverse
engineer and to be as closely tied as possible to the Microsoft platform. This does not translate to a good standard. If a standard is to be decided for Word
Processing it should be human readable, easily understandable, cross platform, and leave room for upgrades with bidirectional compatibility.
The Office formats have no concept of expandability and are neither forward nor backward compatible because Microsoft always intends to replace the
format with something incompatible in the next release to force users to upgrade. The Office file formats have no concept of interoperability because
Microsoft's primary concern is forcing people to use Microsoft Office on Microsoft Windows. The Office file formats are not easy to implement or
understand because part of their purpose is to delay competitors from reverse engineering them."
-
Plaintext was an essential reason why HTML succeeded
- people could see how it was structured, and do it themselves.
Scripting Word (and other Office)
It is possible, but hard, to write scripts to process Office files.
Q. Say you have 1,000 MS Word .doc (pre-2007) documents on disk.
How do you search for a string in them?
You can't do:
grep string *doc
-
The default search
in
Windows Explorer
can search for strings in Word files, but only returns the list of files that match,
not the detailed output
that grep returns.
Q. Can Windows Explorer search be called from a DOS script?
(Would need to return text output list of files.)
- Desktop search programs
will pre-index your files and search them with a GUI or Web interface.
Some can search Word files.
Some can be called through a programming API.
Q. Can any of these be called from a DOS script like grep?
- Google Desktop
- Windows Search
- You can use VBScript.
Q. Say you have 1,000 MS Word documents on disk.
How do you go through them,
changing string S1 everywhere it occurs into string S2?
This is easy (a few lines of code) if you have text formats, UNIX and shell scripts
(
grep
and
sed).
- You can use VBScript.
Q. For all the above,
do I have to use Windows?
What if I want to copy my 1,000 Word files to UNIX, where I normally work.
Can I manipulate them there on UNIX?
Or do I have to use Windows?
Q. Can you script access to
online
Office applications like
Google Docs?
- Google Apps Script
- scripting access to applications in
the
Google Apps
family.
The end of Office secret format
The end of Office secret format
- Office 2007 on