This blog is obselete, please head forward to this blog.
I’m tired of transform shift-jis encoding to UTF-8 encoding for each file in my project these days, so I want to write a script to automatically do this job for me. After searching the Internet, I find it’s an easy job with the tool of Python.
Python, at least 2.6 version, has a library called codecs, and all we have to do is just using this library to read and write files in different encodings.
This code transforms all files, including files in sub-folders, from shift-jis encoding(or detected encodings) to UTF-8 encoding.
Install chardet first.
1
pip install chardet
Copy this script and put it in the folder you want to do transform and run it.
#!/usr/bin/pythonimportosimportreimportsysimportchardet#Created by Leon on March, 5, 2011#Translate all files in current folder to utf-8 encodingfile_pattern=r'^.*\.(h|m|mm|cpp|inl|def|txt|js|html?|c|py|css)$'to_encoding='utf-8'deftranscode(file_name):# Backupbk_file=file_name+'.bk'fi=open(file_name)fo=open(bk_file,'w')fo.write(fi.read())fo.close()fi.close()# Transfin=open(bk_file)succeed=Truetry:data=fin.read()c=chardet.detect(data)ifcisNoneorc['confidence']<0.618:raiseExceptionifc['encoding']!=to_encoding:ifc['encoding']in('GB2312','GBK'):c['encoding']='GB18030'printfile_name+': '+c['encoding']+' ==> '+to_encodingdata=unicode(data,encoding=c['encoding']).encode(to_encoding)fout=open(file_name,'w')fout.write(data)fout.close()except:succeed=Falseprintfile_name+'\'s encoding not known.'fin.close()os.remove(bk_file)returnsucceedpath=os.path.abspath(os.path.dirname(sys.argv[0]))print"Current Path: "+patherrors=[]fordirpath,dirs,filesinos.walk(path):forfilenameinfiles:ifre.search(file_pattern,filename)andfilename!=__file__:printfilename+' ... 'ifnottranscode(os.path.join(dirpath,filename)):errors.append(filename)iferrors:print"--------------------------------------------------------"print"These files got error:"forerrinerrors:printerrprint"--------------------------------------------------------"else:printprint"All files have been translated successfully."printprint"Created for you by Leon on March, 5, 2011."raw_input()
Actually this script can detect the encoding of files, and transform all files not in utf-8, like shift-jis, gbk, gb2312, asscii(trival) or cp936 etc to utf-8 encoding.