Header Ads

Handling unknown encoding of file names

Sometimes we get files from different systems with different locale, operating systems, languages etc.
Some operating systems like Linux tread file names as bytes. but to display this filename in terminal or store filename as string in database or displaying in website needs this text to be encoded into string.

What is encoding?
For computers, everything is a sequence of 0 and 1. How to treat that data is decided by the metadata related to that data. for an example bytes can be treated as pixels in images, numbers, or characters. Encoding is process of generating characters from binary data. There are different schemes to represent characters in binary and convert from binary to back to the characters.  One of the well known and simple encoding is ascii.
for example 01000001 convert to capital A if we use ascii encoding.

if we don't know which encoding is used we will not be able to produce correct characters from given binary sequence. Some bytes can be invalid in a particular encoding and the same bytes may represent different characters in different encoding.

coming back to the problem:
when we don't know which encoding was used on the system decode string we are left with following options.

1. Encode decoded string as some default encoding and ignore the invalid characters. 
2. Encode decoded string as some default encoding and replace the invalid characters with some placeholder like ? character. 
3. Encode decoded string with default encoding and use surrogate characters (special characters to represent invalid characters.
4. Detect encoding of text and use correct encoding (chardet or other encoding detection packages)


No comments

Powered by Blogger.