Latest Tweets

ownCloud and those old non-UTF-8-encoded files

The problem

On a fresh ownCloud 8.2.2 installation, trying to synchronize a bunch of old files with international characters using the owncloud client software throws an error and the whole process stops. These files came from different operating systems prior to the standardization of the UTF-8 charset.

Determining the different encodings

The whole directory structure is about 17GB in size. In it, some file names are encoded using different charsets, so some of them are UTF-8-encoded whereas some others are not. The main problem is that ownCloud uses UTF-8 by default, so only UTF-8-encoded files are going to be correctly synchronized. Therefore, we need to rename every single file or directory containing international characters that are encoded with whatever encoding system, as long as it is not UTF-8.

First thing is to be able to identify all those files and what sort of encoding they have in their names. Bear in mind that our main problem here is to encode their filenames using UTF-8; we do not even care about their contents (which, by the way, are going to remain the same after running our fix):

We can use the chardet module for python and write a trivial python-script to achieve this:

#!/usr/bin/python
import os
import chardet
 
dirname	=	"directory_to_fix"
total       = 	0 
totald		=	0
logfile     =   "./logfile.txt"
line		=	""
charsets	=	[];
 
logf    =       open(logfile,"w+")
 
def addCharset (charset):
	global charsets
	if (charset not in charsets):
		charsets.append(charset)
 
for dname, dnames, fnames in os.walk(dirname):
	totald+=1
	charset = chardet.detect(dname)['encoding']
	addCharset(charset)
	line = dname + ":[" + charset + "]"
	line+="\n"
	logf.write(line)
	for fname in fnames:
		total+=1
		addCharset(chardet.detect(fname)['encoding'])
		line = "\t" + fname + ":[" + charset + "]"
		line+="\n"
		logf.write(line)
 
print ">>> SUMMARY <<<"
print "Total directories : " , totald , "."
print "Total files:" , total , "."
print "Total charsets detected: " , len(charsets)	, "."
print ">>> CHARSETS DETECTED <<<"
for charset in charsets:
	print "\t" , charset
 
logf.close()

After running this python script, this is what we get:

>>> SUMMARY <<<
Total directories : 1532 .
Total files: 23795 .
Total charsets detected: 5 .
>>> CHARSETS DETECTED <<<
ascii
GB2312
ISO-8859-2
utf-8
windows-1252

The logfile created during the script execution allows us to associate every single directory name and file name with its actual encoding. The chardet module has a value of confidence that we do not use here, although it could increase the effectiveness of the encoding detection.

So, having a look at our logfile we can find all the filenames or directory names that are not UTF-8-encoded. For example, let’s look for any GB2312-encoded file or directory name:

14033 directory_to_fix/armari/tesina/s/Sección_y=31:[GB2312]
14034 crosevol.dat:[GB2312]
14035 morfo51.dat~:[GB2312]

So now it is a matter of trying to fix their names so that they are correctly encoded using UTF-8. Otherwise, ownCloud would not be able to synchronize them.

The convmv utility

We can toy around with the filenames encoding problem using the convmv utility. This tool translates one filename or directory name from its current encoding to the desired one. So we install it this way:

# apt-get install convmv

Now, imagine we have a filename encoded using, say, ISO-8859-1 and we want to encode its filename using UTF-8 instead. We can run the tool this way:

# convmv -f ISO-8859-1 -t UTF-8 –notest filename

And now, the filename has been encoded using UTF-8! Easy as pie, right?

Let’s use convmv in our python script

Of course there are other ways to translate the filename or directory name from its current encoding to UTF-8 using pure Python. But we are going to make use of convmv instead. Therefore, we can add a new function in our script called iconvName:

def iconvName (charset,dname,fnamep):
	global line
	line+= " ==> [UTF-8]" 
	fname = "\"" + os.path.join(dname,fnamep) + "\""
	cmd = "convmv -f" + charset + " -t UTF-8  --notest " + fname + "&>/dev/null"
	os.system(cmd)

And now, we have to insert a few lines of code in the main loop in order to rename either the directory names or the filenames by calling this new function:

for dname, dnames, fnames in os.walk(dirname):
	totald+=1
	charset = chardet.detect(dname)['encoding']
	addCharset(charset)
	line = dname + ":[" + charset + "]"
	if 'ascii' not in charset and 'utf-8' not in charset:
		iconvName(charset,dname,"")
	line+="\n"
	logf.write(line)
	for fname in fnames:
		total+=1
		addCharset(chardet.detect(fname)['encoding'])
		line = "\t" + fname + ":[" + charset + "]"
		if 'ascii' not in charset and 'utf-8' not in charset:
			iconvName(charset,dname,fname)
		line+="\n"
		logf.write(line)

We do not rename any file having either ascii or utf-8 in the charset variable. Ascii encoding basically means that there are not international characters in the file or directory name, so it is not going to be problematic for ownCloud.

Once we have made all these changes to our script, we re-ran it. Now, having a look at the logfile we can see all the files or directories that have been renamed to accommodate their international characters to the UTF-8 encoding standard:

14033 directory_to_fix/armari/tesina/s/Sección_y=31:[GB2312]
14034 crosevol.dat:[GB2312] ==> [UTF-8]
14035 morfo51.dat~:[GB2312] ==> [UTF-8]

So far so good. Now, we execute the ownclient software once again, starting the synchronization process. This time, every single file is correctly synchronized with no issues at all (including the bunch of old files). They have been transfered to the client using a valid encoded UTF-8 filename this time, so they are correctly seen (no odd characters involved). Problem fixed!

It is obvious that my script can be improved in so many ways. For starters, you should check the confidence variable returned by chardet.detect() in order to make sure the detected encoding is reliable enough. Another good idea would be to replace the call to convmv entirely writing it with pure Python. A good idea would be to replace the call to os.system() so that we can process the convmv exit value to determine whether the translation has been completed successfully or not. And last but no least, maybe using threads would increase the performance on so large a directory. So feel free to adapt it as you please!

You can get a generic script version HERE. This script can be used this way:

./utf8.py -d directory_to_explore_and_fix -r