Thursday, October 15, 2015

Java - Finding Frequent Phrases in a Large File Effeciently

October 15, 2015 - I was bored so I went to play around with Java and solve some simple problem or exercise below to energize my playful mind.

Given a large file that does not fit in memory (say 10GB), find the top 100000 most frequent phrases. The file has 50 phrases per line separated by a pipe (|). Assume that the phrases do not contain pipe.

Example line may look like: 

Foobar Candy | Olympics 2012 | PGA | CNET | Microsoft Bing ….
The above line has 5 phrases in visible region.

Solution:

{actual method}

    public List<Map.Entry<String,Integer>> findTopHundredThousandFrequentPhrases(String folderPath, String fileName) {
        Map<String, Integer> occurrences = new HashMap<>();

        Path path = Paths.get(folderPath, fileName);
        try (BufferedReader br = Files.newBufferedReader(path, Charset.forName("UTF-8"))) {
            String line;
            while ((line = br.readLine()) != null) {
                String[] phrases = line.split("\\|");
                for (String phrase: phrases) {
                    if (!occurrences.containsKey(phrase)) {
                        occurrences.put(phrase, 1);
                    } else {
                        occurrences.put(phrase, occurrences.get(phrase) + 1);
                    }
                }
            }
        } catch (IOException ioe) {
            ioe.printStackTrace();
            return Collections.emptyList();
        }

        List<Map.Entry<String, Integer>> hashMapEntries = new ArrayList<>(occurrences.entrySet());
        Collections.sort(hashMapEntries, (e1, e2) -> Integer.compare(e2.getValue(), e1.getValue()));

        return hashMapEntries
                .stream()
                .limit(100000).collect(Collectors.toList());
    }

{support class}

public class ComplementaryPair {
    public int addendOne;
    public int addendTwo;

    public ComplementaryPair(int addendOne, int addendTwo) {
        this.addendOne = addendOne;
        this.addendTwo = addendTwo;
    }

    public int sum() {
        return addendOne + addendTwo;
    }

    @Override
    public String toString() {
        return addendOne + " + " + addendTwo;
    }

    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;

        ComplementaryPair that = (ComplementaryPair) o;

        Set<Integer> numbers = new HashSet<>(
                Arrays.asList(
                    that.addendOne, that.addendTwo, addendOne, addendTwo
                ));

        return numbers.size() <= 2;

    }

    @Override
    public int hashCode() {
        int result = addendOne + addendTwo;
        result = 31 * result;
        return result;
    }
}

I may be wrong but hopefully I got this right. Feel free to comment or let me know your thoughts.

1 comment:

  1. Hi Paul,
    BUT what about saving occurrences Map with a bulk of 10 GB of data in the Memory ?

    ReplyDelete