Last Updated on 2022-09-19 by Clay
題目
Given a list paths
of directory info, including the directory path, and all the files with contents in this directory, return all the duplicate files in the file system in terms of their paths. You may return the answer in any order.
A group of duplicate files consists of at least two files that have the same content.
A single directory info string in the input list has the following format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
It means there are n
files (f1.txt, f2.txt ... fn.txt)
with content (f1_content, f2_content ... fn_content)
respectively in the directory "root/d1/d2/.../dm"
. Note that n >= 1
and m >= 0
. If m = 0
, it means the directory is just the root directory.
The output is a list of groups of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
"directory_path/file_name.txt"
Example 1:
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)","root 4.txt(efgh)"] Output: [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]
Example 2:
Input: paths = ["root/a 1.txt(abcd) 2.txt(efgh)","root/c 3.txt(abcd)","root/c/d 4.txt(efgh)"] Output: [["root/a/2.txt","root/c/d/4.txt"],["root/a/1.txt","root/c/3.txt"]]
這是難得一見『倒讚』比『讚』來得多的題目。不過除了題目看起來沒什麼實用性外,我沒能很好地理解這題的不堪之處。
簡單來講,我們會拿到多個文件路徑 paths
,文件路徑的表示方法如下:
<directory_path> <f1.txt>(<f1_content>) <f2.txt>(<f2_content>)... <fn.txt>(<fn_content>)
其中括號中的,就是文件的內容(content)。我們要返回多組字串陣列,同一組陣列中的字串儲存重複內容的文件路徑。
比方說 Example 1,由於
root/a/2.txt
root/c/d/4.txt
root/4.txt
三份文件的內容都是 efgh,所以需要回傳他們三份文件路徑的陣列。
如果文件沒有和其他文件重複內容,則無須回傳。
解題紀錄
我很自然地使用暴力法解決...... 不得不說這種文字處理 Python 真的比 C++ 好寫太多了。
簡單來講,就是我在遇到第一個空白前,我把所有字元儲存為當前路徑。
接下來開始儲存文件名稱,直到遇到左括號 (
,這才轉為儲存文件內容。遇到右括號 )
後,則又重新開始儲存文件名稱。
同樣內容的文件路徑,我是用一個字典把他們儲存起來。key 是文件內容,value 是儲存各個文件路徑的字串陣列。
最後,只要字典中的值(值是陣列)有超過兩個以上的文件路徑,則將其儲存在即將要回傳的答案陣列中。
複雜度
Time complexity | O(n) |
Space complexity | O(n) |
C++ 範例程式碼
class Solution {
public:
vector<vector<string>> findDuplicate(vector<string>& paths) {
// Init
int index = 0;
unordered_map<string, vector<string>> mp;
vector<vector<string>> results;
// Find
for (auto& path: paths) {
int i = 0;
bool isFilepath = true;
string _path;
string filepath;
string content;
// Combine path
while (path[i] != ' ') {
_path.push_back(path[i]);
++i;
}
_path.push_back('/');
// Combine filepath and content <FILE_PATH>(<CONTENT>)
for (int j=i+1; j<path.size(); ++j) {
char c = path[j];
if (c == ' ') {
continue;
}
// File path
if (isFilepath) {
if (c != '(') {
filepath.push_back(c);
}
else {
isFilepath = false;
}
}
// Content
else {
if (c != ')') {
content.push_back(c);
}
else {
if (!mp.count(content)) {
mp[content] = vector<string>();
}
mp[content].push_back(_path+filepath);
isFilepath = true;
filepath.clear();
content.clear();
}
}
}
}
// Combined results
for (auto& it: mp) {
if (it.second.size() > 1) {
results.push_back(it.second);
}
}
return results;
}
};
Python 範例程式碼
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
# Init
mp = dict()
for path in paths:
data = path.split(" ")
_path, file_info_group = data[0], data[1:]
for info in file_info_group:
filename, content = info.split("(")
content = content[:-1]
mp[content] = mp.get(content, []) + [os.path.join(_path, filename)]
return [mp[content] for content in mp if len(mp[content])>1]