The utf8.py script
Some months ago I had to face an annoying issue that affected the ownCloud client during the folder-synchronization process. As a result of that I wrote a trivial python script that helped me fix rename the non-UTF8 filenames using the UTF-8 encoding. Today I had to deal with the very same issue, so I decided to add some functionality to the original script I wrote.
Pre-requisites
This script has been written in Python 2.7. This is what you will need in order to execute the script:
- Python 2.7.
- The conmv utility. (# apt-get install convmv).
- The Python Chardet module (# apt-get install python-chardet).
- The script itself, utf8.py.
Using the script
./utf8.py -d PATH [-t THRESHOLD][-l LOG][-r ]
-d PATH:
The directory to analyse and, if the -r flag is given, to fix (i.e., all the files and directories inside the PATH directory will be renamed according to the UTF-8 encoding standard).
-t THRESHOLD
The Chardet module has a value called “confidence”. This value offers a quantized factor for any particular detected charset. By using the -t flag, one can set the minimal value for confidence that a particular detected charset must match before attempting to rename the file or directory using UTF-8. This is a numerical value in the range [0..1]. Default value: 0.8.
-l LOG
By default, the script will create a logfile in the same directory where it is executed called utf8-log.txt. Passing this flag, one can choose where the logfile should be and its name.
-r
By default, the execution of the utf8.py script is a dry-run; i.e., the files and directoris of PATH will not be renamed. Therefore, by passing the script this flag, the files and directories inside PATH will be renamed.
Examples
This command will generate a log file under /tmp/analysis.log for the directory /home/data, detecting any non-UTF8 charset with a default confidence of 0.8. No file or directory renaming will take place, so the directory /home/data will remain unchanged:
./utf8.py -d /home/data -l /tmp/analysis.log
This command will rename any file and directory under /home/data that has a minimal value of 0.95 for confidence, the rest will not be renamed:
./utf8.py -d /home/data -l /tmp/renamed.log -t 0.95 -r
Download the script
You can get the latest version for this script right here.