R: Dealing With File Encodings Using readr::guess_encoding

eric | May 7, 2021, 4:47 p.m.

Dealing with different file encodings for a set of data can be a bit of a pain [1], but there is one tool that is really useful in this situation. Using the readr-package[2] with its guess_encoding-function for reading files works most of the time. A few additions can make it even better.

Introduction

So what is the problem with encodings? Say you want to read a CSV-file. If you don’t specify encoding when using read.csv(), you might find yourself in trouble:

 df1_2016 <- read.csv(
    "~/Git/data_science/Natura2000/data-original/2016/BIOREGION.csv",
    header=TRUE, sep=",")
Error in make.names(col.names, unique = TRUE) : 
 invalid multibyte string at '<0a>'
In addition: Warning messages:
1: In read.table(file = file, header = header, sep = sep, quote = quote, :
 line 1 appears to contain embedded nulls
2: In read.table(file = file, header = header, sep = sep, quote = quote, :
 line 2 appears to contain embedded nulls
3: In read.table(file = file, header = header, sep = sep, quote = quote, :
 line 3 appears to contain embedded nulls
4: In read.table(file = file, header = header, sep = sep, quote = quote, :
 line 4 appears to contain embedded nulls
5: In read.table(file = file, header = header, sep = sep, quote = quote, :
 line 5 appears to contain embedded nulls
6: In read.table(file = file, header = header, sep = sep, quote = quote, :
 line 1 appears to contain embedded nulls

What is going on? This looks like a typical encoding problem. If we check the encoding for the file from the command line, we get “UTF-16LE”. This needs to be specified in the fileEncoding attribute of read.csv. From the file help page:

    The encodings ‘”UCS-2LE”’ and ‘”UTF-16LE”’ are treated specially,
    as they are appropriate values for Windows ‘Unicode’ text files.

In addition, we should specify the dec attribute. We the get:

     > df1_2016 <- read.csv(
        "~/Git/data_science/Natura2000/data-original/2016/BIOREGION.csv",
        header=TRUE, sep=",", fileEncoding="UTF-16LE",dec=".")

This works, but so far it is a hopelessly manual process that will not work well in a reproducible data pipeline. How can we create a fully automated process for reading CSVs with different encodings?

A Fully Automated Solution

To automate the reading process, we need the encoding of each file to be read. If you know the possible set of encodings, you can combine that knowledge with guess_encodings to get a much more reliable, and possibly fully automated solution. It is not absolutely fail-safe though, so proper error handling must be a part of the script. One possible set of rules for the guessing process:

  • Take the first value in all probable encodings ("encodings_probable" below) and check if it is in the set of encodings guessed by guess_encoding ("encodings_guessed" below). If it is, use that encoding to try to read the file.
  • If the first value of probable encodings is not in guessed encodings, or the first encoding didn't work out, move on to the next value in probable encodings, and so forth.
  • If 1 or 2 does not work out, go to town on the encodings and try all available encodings.

Typically, a file-read that does not use the correct encoding will take a long time, and timeout could therefore be used to discard such reads as not successful. This is a high-risk strategy, however, as the reading of big files and / or reading from shared resources could lead to time-outs.

A Function For Reading

First of all, we need a function that reads a CSV, and give us some meaningful feedback if it fails. It could look something like this:

read_encoded_csv <- function(file_to_read,file_encoding) {
  out <- tryCatch(
    {
      message("Trying encoding")
      read.csv(file_to_read, header=TRUE, sep=",", fileEncoding=file_encoding, dec=".")
    },
    error=function(cond) {
      message(paste("Encoding doesn't seem to work:", file_encoding))
      message("Here's the original error message:")
      message(cond)
      # Return value in case of error
      return(NA)
    },
    warning=function(cond) {
      message(paste("Encoding caused a warning:", file_encoding))
      message("Here's the original warning message:")
      message(cond)
      # Return value in case of warning
      return(NULL)
    },
    finally={ 
      message("\n")
      message(paste("Processed file:", file_to_read))
      message(paste("Processed encoding:", file_encoding))
    }
  )    
  return(out)
}

Guessing The Encoding

If we use guess_encoding from the readr-package, combined with knowledge of the encoding that the files are most likely to have, we can use something like this:

determine_encoding <- function(file_to_read,encodings_probable) {
  message(paste("DETERMINE ENCODING FOR NEW FILE: "), file_to_read)
  encoding_used <- ""
  encodings_guessed <- guess_encoding(file_to_read)[["encoding"]]
  encodings_all <- c("UTF-8","UTF-16BE","UTF-16LE","UTF-32BE","UTF-32LE",
                    "Shift_JIS","ISO-2022-JP","ISO-2022-CN","ISO-2022-KR",
                    "GB18030","Big5","EUC-JP","EUC-KR","ISO-8859-1","ISO-8859-2",
                    "ISO-8859-5","ISO-8859-6","ISO-8859-7","ISO-8859-8",
                    "ISO-8859-9","windows-1250","windows-1251","windows-1252",
                    "windows-1253","windows-1254","windows-1255","windows-1256",
                    "KOI8-R","IBM420","IBM424")
  for(encoding_guessed in encodings_guessed) {
    print(encoding_guessed)
    print(encodings_guessed)
    if(encoding_guessed %in% encodings_probable){
      t <- read_encoded_csv(file_to_read,encoding_guessed)
      if(reading_makes_sense(t)) {
        encoding_used <- encoding_guessed
        break()
      }
    }
  }
  if (encoding_used == "") {
    for (encoding_probable in encodings_probable) {
      t <- read_encoded_csv(file_to_read,encoding_probable)
      if(reading_makes_sense(t)) {
        encoding_used <- encoding_probable
        break()
      }
    }
  }
  if (encoding_used == "") {
    for (encoding_all in encodings_all) {
      t <- read_encoded_csv(file_to_read,encoding_all)
      if(reading_makes_sense(t)) {
        encoding_used <- encoding_all
        break()
      }
    }
  }
  return(encoding_used)
}

If the initial guess fails, we try all the probable encodings. If that fails, we try all possible encodings. We find all possible encodings with iconvlist(). However, the readr-package uses stringi::stri_enc_detect to get guess_encoding to work. This function has its limits in terms of the encodings it can detect, so it only makes sense to check for those encodings: UTF-8 -- UTF-16BE -- UTF-16LE -- UTF-32BE -- UTF-32LE -- Shift_JIS Japanese ISO-2022-JP Japanese ISO-2022-CN Simplified Chinese ISO-2022-KR Korean GB18030 Chinese Big5 Traditional Chinese EUC-JP Japanese EUC-KR Korean ISO-8859-1 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish ISO-8859-2 Czech, Hungarian, Polish, Romanian ISO-8859-5 Russian ISO-8859-6 Arabic ISO-8859-7 Greek ISO-8859-8 Hebrew ISO-8859-9 Turkish windows-1250 Czech, Hungarian, Polish, Romanian windows-1251 Russian windows-1252 Danish, Dutch, English, French, German, Italian, Norwegian, Portuguese, Swedish windows-1253 Greek windows-1254 Turkish windows-1255 Hebrew windows-1256 Arabic KOI8-R Russian IBM420 Arabic IBM424 Hebrew

Does It Make Sense?

We should check that the reading makes sense, even when an encoding is found and read.csv() reads the CSV-file without a hitch. Using what is known about the CSVs, pertinent rules can be; (1) the result of the file read is a data frame, (2) that the data frame has rows, and that (3) the data frame has columns:

reading_makes_sense <- function(content_read) {
  out <- 
    (
    is.data.frame(content_read) &&
    nrow(content_read) > 0 &&
    ncol(content_read) > 0
    )
  
  return(out)
}

Putting It All Together - Main Loop

The main loop could look something like this:

files <- list.files(path="data-original", pattern="*.csv", 
                    full.names=T, recursive=TRUE)
encodings_probable <- c("UTF-8","UTF-16LE")
lapply(files, function(x) {
  file_encoding <- determine_encoding(x,encodings_probable)
  message(paste("Determined encoding: ", file_encoding))
  if (file_encoding != "") {
    t <- read_encoded_csv(x,file_encoding)
    # apply function
    out <- dim(t)
  } else (out <- "No encoding found!")
  # write to file
  write.table(out, file = "data/test.txt", append = TRUE, 
              sep="\t", quote=F, row.names=T, col.names=x)
  rm(t)
})

It isn't particularly pretty, but it gets the job done. In fact, the code has been used extensively on real data-sets and have only failed when dealing with corrupted files, as would be the case for all file reading functions. Any ideas on refactoring are most welcome.

Conclusion

It is possible to automate file-reading of files with different encoding using a combination of what you know about the files in terms of their possible encodings, the separator used, the decimal sign used, and the desired result of the reading. Testing has shown that this works if the files have one of the encodings that can be used by readr's guess_encoding.

References

[1] How to detect the right encoding for read.csv - https://stackoverflow.com/questions/4806823/how-to-detect-the-right-encoding-for-read-csv/35854870#35854870 [2] The readr-package - https://cran.r-project.org/web/packages/testthis/testthis.pdf

Annex

In practical use, the script above has only failed to produce a result when dealing with corrupt files. A typical sign that a file is corrupt, is the following messages from the script:

[1] "DETERMINE ENCODING: NEW FILE"
[1] "UTF-8"
[1] "UTF-8"        "windows-1252" "Big5"         "windows-1254"
Trying encoding
Encoding caused a warning: UTF-8
Here's the original warning message:
embedded nul(s) found in input

Processed file: data-original/2012/NATURA2000SITES.csv
Processed encoding: UTF-8
[1] "windows-1252"
[1] "UTF-8"        "windows-1252" "Big5"         "windows-1254"
[1] "Big5"
[1] "UTF-8"        "windows-1252" "Big5"         "windows-1254"
[1] "windows-1254"
[1] "UTF-8"        "windows-1252" "Big5"         "windows-1254"
Trying encoding
Encoding caused a warning: UTF-8
Here's the original warning message:
embedded nul(s) found in input

Processed file: data-original/2012/NATURA2000SITES.csv
Processed encoding: UTF-8
Trying encoding
Encoding caused a warning: UTF-16LE
Here's the original warning message:
line 1 appears to contain embedded nulls

[...]

Processed file: data-original/2012/NATURA2000SITES.csv
Processed encoding: KOI8-R
Trying encoding
Encoding caused a warning: IBM420
Here's the original warning message:
invalid input found on input connection 'data-original/2012/NATURA2000SITES.csv'

Processed file: data-original/2012/NATURA2000SITES.csv
Processed encoding: IBM420
Trying encoding
Encoding caused a warning: IBM424
Here's the original warning message:
invalid input found on input connection 'data-original/2012/NATURA2000SITES.csv'

Processed file: data-original/2012/NATURA2000SITES.csv
Processed encoding: IBM424
Determined encoding:  

About Me

Experienced dev and PM. Data science, DataOps, Python and R. DevOps, Linux, clean code and agile. 10+ years working remotely. Polyglot. Startup experience.
LinkedIn Profile

By Me

Statistics & R - a blog about - you guessed it - statistics and the R programming language.
R-blog

Erlang Explained - a blog on the marvelllous programming language Erlang.
Erlang Explained