The utf8.py script
Some months ago I had to face an annoying issue that affected the ownCloud client during the folder-synchronization process. As a result of that I wrote a trivial python script that helped me fix rename the non-UTF8 filenames using the UTF-8 encoding. Today I had to deal with the very same issue, so I decided to add some functionality to the original script I wrote.
This script has been written in Python 2.7. This is what you will need in order to execute the script:
- Python 2.7.
- The conmv utility. (# apt-get install convmv).
- The Python Chardet module (# apt-get install python-chardet).
- The script itself, utf8.py.
Using the script
./utf8.py -d PATH [-t THRESHOLD][-l LOG][-r ]
The directory to analyse and, if the -r flag is given, to fix (i.e., all the files and directories inside the PATH directory will be renamed according to the UTF-8 encoding standard).
The Chardet module has a value called “confidence”. This value offers a quantized factor for any particular detected charset. By using the -t flag, one can set the minimal value for confidence that a particular detected charset must match before attempting to rename the file or directory using UTF-8. This is a numerical value in the range [0..1]. Default value: 0.8.
By default, the script will create a logfile in the same directory where it is executed called utf8-log.txt. Passing this flag, one can choose where the logfile should be and its name.
By default, the execution of the utf8.py script is a dry-run; i.e., the files and directoris of PATH will not be renamed. Therefore, by passing the script this flag, the files and directories inside PATH will be renamed.
This command will generate a log file under /tmp/analysis.log for the directory /home/data, detecting any non-UTF8 charset with a default confidence of 0.8. No file or directory renaming will take place, so the directory /home/data will remain unchanged:
./utf8.py -d /home/data -l /tmp/analysis.log
This command will rename any file and directory under /home/data that has a minimal value of 0.95 for confidence, the rest will not be renamed:
./utf8.py -d /home/data -l /tmp/renamed.log -t 0.95 -r
Download the script
You can get the latest version for this script right here.