[Εξώφυλλο]

Machine learning algorithms for big data = Αλγόριθμοι μάθησης για μεγάλης κλίμακας δεδομένα.

Elisavet Tsolakidou

Περίληψη


The purpose of this thesis is to present both a comprehensive study and a practical example of methods and tools used for word and document embeddings. Embeddings or vector representations are the necessary first step before natural language data are fed into any neural network for processing. Having vectors whose parameters capture real world properties of the corresponding words or documents has been known to be pivotal for the success of natural language processing tasks such as classification, summarization and translation amongst others. Following the detailed presentation of the most popular methods; experiments and their implementations are conducted using Python programming language in the Greek language and their results are discussed.

Σκοπός της παρούσας διπλωματικής εργασίας είναι να παρουσιαστούν οι μέθοδοι και τα εργαλεία που χρησιμοποιούνται για τη διανυσματική αναπαράσταση λέξεων και εγγράφων. Η διανυσματική αναπαράσταση των λέξεων αποτελεί απαραίτητο πρώτο βήμα για την τροφοδότηση δεδομένων φυσικής γλώσσας σε οποιοδήποτε νευρωνικό δίκτυο με σκοπό την επεξεργασία των λέξεων και την εξαγωγή μοντέλων. Η αναπαράσταση των λέξεων ως διανύσματα των οποίων οι παράμετροι σκιαγραφούν την πληθώρα των ιδιοτήτων τους έχει παρατηρηθεί ότι είναι καθοριστική για την επιτυχία διεργασιών επεξεργασίας φυσικής γλώσσας όπως είναι η κατηγοριοποίηση (classification), περίληψη (summarization) και μετάφραση (translation) κειμένων, μεταξύ άλλων. Αρχικά παρουσιάζονται οι πιο γνωστές τεχνικές που χρησιμοποιούνται για την αναπαράσταση λέξεων και εγγράφων και στην συνέχεια μέσω της γλώσσας προγραμματισμού Python μελετάται η αναπαράσταση λέξεων και εγγράφων στην ελληνική γλώσσα.

Πλήρες Κείμενο:

PDF

Αναφορές


Aggarwal, C. C., & Zhai, C. (2012). A survey of text classification algorithms. In Mining text data (pp. 163-222). Springer, Boston, MA.

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. O'Reilly Media, Inc.

Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T. (2016). fastText. URL: https://research.fb.com/fasttext/ . Accessed on: 17/10/18.

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.

Brownlee, J. (2017). “A Gentle Introduction to the Bag-of-Words Model”. URL: https://machinelearningmastery.com/gentle-introduction-bag-words-model/. Accessed on: 17/09/18.

Dahl, G. E., Yu, D., Deng, L., & Acero, A. (2012). Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Transactions on audio, speech, and language processing, 20(1), 30-42.

Dechter, R. (1986). Learning while searching in constraint-satisfaction-problems. Proceedings of the Fifth AAAI National Conference on Artificial Intelligence, 178-183.

Deng, L. (2014). A tutorial survey of architectures, algorithms, and applications for deep learning. APSIPA Transactions on Signal and Information Processing, 3.

D'Souza, J., “An Introduction to Bag-of-Words in NLP”. URL:https://medium.com/greyatom/an-introduction-to-bag-of-words-in-nlp-ac967d43b428. Accessed on: 9/09/18.

Firth, J.R. (1957). "A synopsis of linguistic theory 1930-1955". Studies in Linguistic Analysis. Oxford: Philologival Society: 1-32. Reprinted in F.R. Palmer, ed. (1968). Selected Papers of J.R. Firth 1952-1959. London: Longman.

Gebre, B. G., Zampieri, M., Wittenburg, P., & Heskes, T. (2013). Improving native language identification with tf-idf weighting. In the 8th NAACL Workshop on Innovative Use of NLP for Building Educational Applications (BEA8) (pp. 216-223).

Giel, A., & Diaz, R. (2016). Document embeddings via recurrent language models. University of Stanford, Tech. Rep.

Goldberg, Y., & Levy, O. (2014). word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27-48.

Guthrie, D., Allison, B., Liu, W., Guthrie, L., & Wilks, Y. (2006, May). A closer look at skip-gram modelling. In Proceedings of the 5th international Conference on Language Resources and Evaluation (LREC-2006) (pp. 1-4).

Harris, Z. (1954). "Distributional structure". Word 10 (23), 146–162 .

Ji, S., Yun, H., Yanardag, P., Matsushima, S., & Vishwanathan, S. V. N. (2015). Wordrank: Learning word embeddings via robust ranking. arXiv preprint arXiv:1506.02761.

Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.

Joulin, A., Grave, E., Bojanowski, P., Douze, M., Jégou, H., & Mikolov, T. (2016). Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.

Le, Q., & Mikolov, T. (2014, January). Distributed representations of sentences and documents. In International Conference on Machine Learning (pp. 1188-1196).

Lilleberg, J., Zhu, Y., & Zhang, Y. (2015, July). Support vector machines and word2vec for text classification with semantic features. In Cognitive Informatics & Cognitive Computing (ICCI* CC), 2015 IEEE 14th International Conference on (pp.136-140). IEEE.

Lv, Y., & Zhai, C. (2011, October). Lower-bounding term frequency normalization. In Proceedings of the 20th ACM international conference on

Information and knowledge management (pp. 7-16). ACM.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).

Minar, M. R., & Naher, J. (2018). Recent Advances in Deep Learning: An Overview. arXiv preprint arXiv:1807.08169.

NLTK Project. (2019). Natural Language Toolkit. URL: https://www.nltk.org/ . Accessed on: 13/09/18.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., & Vanderplas, J. (2011). Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct), 2825-2830.

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

Řehůřek, R. (2018). Gensim: Topic Modelling for Humans. URL: https://radimrehurek.com/gensim/. Accessed on: 18/09/18.

Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., & Gatford, M. (1995). Okapi at TREC-3. Nist Special Publication Sp, 109.

Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends® in Information Retrieval, 3(4), 333-389.

Samuel, A. L. (1959). Some studies in machine learning using the game of checkers. IBM Journal of research and development, 3(3), 210-229.

Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural networks, 61, 85-117.

Smedt, T. D., & Daelemans, W. (2012). Pattern for python. Journal of Machine Learning Research, 13(Jun), 2063-2067.

Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), 11-21.

Turnbull, D. (2015). BM25 The Next Generation of Lucene Relevance. URL: https://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/. Accessed on: 21/09/18.

Vincent, J. (2018). Google 'fixed' its racist algorithm by removing gorillas from its image-labeling tech. URL:https://www.theverge.com/2018/1/12/16882408/google-racist-gorillas-photo-recognition-algorithm-ai. Accessed on: 14/09/18.

Yoo, J. Y., & Yang, D. (2015). Classification scheme of unstructured text document using TF-IDF and naive Bayes classifier. Advanced Science and Technology Letters, 111(50), 263-266.

Young, T., Hazarika, D., Poria, S., & Cambria, E. (2018). Recent trends in deep learning based natural language processing. IEEE Computational Intelligence magazine, 13(3), 55-75.

Yu, H., Ho, C., Juan, Y., & Lin, C. (2013). Libshorttext: A library for short-text classification and analysis. Rapport interne, Department of Computer Science, National Taiwan University. Software available at http://www. csie. ntu. edu. tw/cjlin/libshorttext.

Zaragoza, H., Craswell, N., Taylor, M. J., Saria, S., & Robertson, S. E. (2004, November). Microsoft Cambridge at TREC 13: Web and Hard Tracks. In TREC (Vol. 4, pp. 1-1).

Zhang, J., & Zong, C. (2015). Deep neural networks in machine translation: An overview. IEEE Intelligent Systems, 30(5), 16-25.


Εισερχόμενη Αναφορά

  • Δεν υπάρχουν προς το παρόν εισερχόμενες αναφορές.