1 minute read

Sometimes we want to find whether a text documents contains only ASCII characters or whether there is some non-ASCII character in it. You see, ASCII is an old 7 bit-based encoding for text. Hence, it supports a very limited range of different characters and certainly not fancy stuff like “ä” let alone “你好”. Most tools today can understand the Unicode character set encoded with UTF-8, which is more or less compatible to that and can store all such characters. Some software, however, still takes offense if characters appear that are outside the ASCII range, e.g., in names of functions in some programming languages. Other software can deal with Unicode well, but dislikes only certain characters in certain places. If we have some strange LaTeX error when compiling a paper draft, sometimes it can help to check whether we accidentally had maybe a “。” instead of a “.” slip in somewhere.

Here I provide the little script findNonASCIIchars.sh, which does this in the terminal. It takes the path to a file as argument. It then searches the file for characters outside the ASCII range and prints where it found them.

Here you can download this script and the complete collection of my personal scripts is available here.

#!/bin/bash

# This script searches for characters that are non in the normal ASCII range in a document.
# As argument, it expects the path to the file to check.

# strict error handling
set -o pipefail  # trace ERR through pipes
set -o errtrace  # trace ERR through 'time command' and other functions
set -o nounset   # set -u : exit the script if you try to use an uninitialized variable
set -o errexit   # set -e : exit the script if any statement returns a non-true return value

srcDocument="$(realpath "$1")"

if [ -f "$srcDocument" ]; then
  echo "$(date +'%0Y-%0m-%0d %0R:%0S'): Now searching for non-ASCII characters in document '$srcDocument'."
  if (grep --color='auto' -P -n '[^\x00-\x7F]' "$srcDocument"); then
    echo "$(date +'%0Y-%0m-%0d %0R:%0S'): Found some non-ASCII characters in document '$srcDocument'."
  else
    echo "$(date +'%0Y-%0m-%0d %0R:%0S'): No non-ASCII characters found in document '$srcDocument'."
  fi
else
  echo "$(date +'%0Y-%0m-%0d %0R:%0S'): '$srcDocument' is not a file."
  exit 1
fi